diff --git a/locale/es/CODE_OF_CONDUCT.md b/locale/es/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..a820b8df5 --- /dev/null +++ b/locale/es/CODE_OF_CONDUCT.md @@ -0,0 +1,12 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/locale/es/CONTRIBUTING.md b/locale/es/CONTRIBUTING.md new file mode 100644 index 000000000..d29e890c5 --- /dev/null +++ b/locale/es/CONTRIBUTING.md @@ -0,0 +1,122 @@ +## Contributing + +[The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data +Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source +projects, and we welcome contributions of all kinds: new lessons, fixes to +existing material, bug reports, and reviews of proposed changes are all +welcome. + +### Contributor Agreement + +By contributing, you agree that we may redistribute your work under our +license. In exchange, we will address your issues and/or assess +your change proposal as promptly as we can, and help you become a member of our +community. Everyone involved in [The Carpentries][cp-site] agrees to abide by +our [code of conduct](CODE_OF_CONDUCT.md). + +### How to Contribute + +The easiest way to get started is to file an issue to tell us about a spelling +mistake, some awkward wording, or a factual error. This is a good way to +introduce yourself and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, you can [send us comments by + email][contact]. However, we will be able to respond more quickly if you use + one of the other methods described below. + +2. If you have a [GitHub][github] account, or are willing to [create + one][github-join], but do not know how to use Git, you can report problems + or suggest improvements by [creating an issue][repo-issues]. This allows us + to assign the item to someone and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, and would like to add or change material, + you can submit a pull request (PR). Instructions for doing this are + [included below](#using-github). For inspiration about changes that need to + be made, check out the [list of open issues][issues] across the Carpentries. + +Note: if you want to build the website locally, please refer to [The Workbench +documentation][template-doc]. + +### Where to Contribute + +1. If you wish to change this lesson, add issues and pull requests here. +2. If you wish to change the template used for workshop websites, please refer + to [The Workbench documentation][template-doc]. + +### What to Contribute + +There are many ways to contribute, from writing new exercises and improving +existing ones to updating or filling in the documentation and submitting [bug +reports][issues] about things that do not work, are not clear, or are missing. +If you are looking for ideas, please see [the list of issues for this +repository][repo-issues], or the issues for [Data Carpentry][dc-issues], +[Library Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: we are +smarter together than we are on our own. **Reviews from novices and newcomers +are particularly valuable**: it's easy for people who have been using these +lessons for a while to forget how impenetrable some of this material can be, so +fresh eyes are always welcome. + +### What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical +workshop, so we are usually _not_ looking for more concepts or tools to add to +them. As a rule, if you want to introduce a new idea, you must (a) estimate how +long it will take to teach and (b) explain what you would take out to make room +for it. The first encourages contributors to be honest about requirements; the +second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one +platform. Our workshops typically contain a mixture of Windows, macOS, and +Linux users; in order to be usable, our lessons must run equally well on all +three. + +### Using GitHub + +If you choose to contribute via GitHub, you may want to look at [How to +Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we +use [GitHub flow][github-flow] to manage changes: + +1. Create a new branch in your desktop copy of this repository for each + significant change. +2. Commit the change in that branch. +3. Push that branch to your fork of this repository on GitHub. +4. Submit a pull request from that branch to the [upstream repository][repo]. +5. If you receive feedback, make changes on your desktop and push to your + branch on GitHub: the pull request will update automatically. + +NB: The published copy of the lesson is usually in the `main` branch. + +Each lesson has a team of maintainers who review issues and pull requests or +encourage others to do so. The maintainers are community volunteers, and have +final say over what gets merged into the lesson. + +### Other Resources + +The Carpentries is a global organisation with volunteers and learners all over +the world. We share values of inclusivity and a passion for sharing knowledge, +teaching and learning. There are several ways to connect with The Carpentries +community listed at \ including via social +media, slack, newsletters, and email lists. You can also [reach us by +email][contact]. + +[repo]: https://github.com/swcarpentry/r-novice-gapminder +[repo-issues]: https://github.com/swcarpentry/r-novice-gapminder/issues +[contact]: mailto:team@carpentries.org +[cp-site]: https://carpentries.org/ +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry +[dc-lessons]: https://datacarpentry.org/lessons/ +[dc-site]: https://datacarpentry.org/ +[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss +[github]: https://github.com +[github-flow]: https://guides.github.com/introduction/flow/ +[github-join]: https://github.com/join +[how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github +[issues]: https://carpentries.org/help-wanted-issues/ +[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry +[swc-lessons]: https://software-carpentry.org/lessons/ +[swc-site]: https://software-carpentry.org/ +[lc-site]: https://librarycarpentry.org/ +[template-doc]: https://carpentries.github.io/workbench/ diff --git a/locale/es/LICENSE.md b/locale/es/LICENSE.md new file mode 100644 index 000000000..513ad8f83 --- /dev/null +++ b/locale/es/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to \), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +- No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/locale/es/README.md b/locale/es/README.md new file mode 100644 index 000000000..19341cede --- /dev/null +++ b/locale/es/README.md @@ -0,0 +1,6 @@ +# Internationalisation hub repository for Software Carpentry R for Reproducible Scientific Analysis + +An introduction to R for non-programmers using the [Gapminder][gapminder] data. +Please see [https://swcarpentry.github.io/r-novice-gapminder](https://swcarpentry.github.io/r-novice-gapminder) for a rendered version of this material in English. + +More info to follow. diff --git a/locale/es/config.yaml b/locale/es/config.yaml new file mode 100644 index 000000000..e8310df81 --- /dev/null +++ b/locale/es/config.yaml @@ -0,0 +1,71 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'swc' +#Overall title for pages. +title: 'R for Reproducible Scientific Analysis' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2015-04-18' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/swcarpentry/r-novice-gapminder' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'team@carpentries.org' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 01-rstudio-intro.Rmd + - 02-project-intro.Rmd + - 03-seeking-help.Rmd + - 04-data-structures-part1.Rmd + - 05-data-structures-part2.Rmd + - 06-data-subsetting.Rmd + - 07-control-flow.Rmd + - 08-plot-ggplot2.Rmd + - 09-vectorization.Rmd + - 10-functions.Rmd + - 11-writing-data.Rmd + - 12-dplyr.Rmd + - 13-tidyr.Rmd + - 14-knitr-markdown.Rmd + - 15-wrap-up.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://swcarpentry.github.io/r-novice-gapminder' +analytics: carpentries +lang: en diff --git a/locale/es/episodes/01-rstudio-intro.Rmd b/locale/es/episodes/01-rstudio-intro.Rmd new file mode 100644 index 000000000..3949440a1 --- /dev/null +++ b/locale/es/episodes/01-rstudio-intro.Rmd @@ -0,0 +1,722 @@ +--- +title: Introduction to R and RStudio +teaching: 45 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose and use of each pane in RStudio +- Locate buttons and options in RStudio +- Define a variable +- Assign data to a variable +- Manage a workspace in an interactive R session +- Use mathematical and comparison operators +- Call functions +- Manage packages + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to find your way around RStudio? +- How to interact with R? +- How to manage your environment? +- How to install packages? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Before Starting The Workshop + +Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date. + +- [Download and install the latest version of R here](https://www.r-project.org/) +- [Download and install RStudio here](https://www.rstudio.com/products/rstudio/download/#download) + +## Why use R and R studio? + +Welcome to the R portion of the Software Carpentry workshop! + +Science is a multi-step process: once you've designed an experiment and collected +data, the real fun begins with analysis! Throughout this lesson, we're going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier. + +Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to ["reproducible" research](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285). + +Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program. + +However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms (including +on servers) and provides many advantages such as integration with version +control and project management. + +## Overview + +We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from [gapminder.org](https://www.gapminder.org) containing population information for many +countries through time. Can you read the data into R? Can you plot the population for +Senegal? Can you calculate the average income for countries on the continent of Asia? +By the end of these lessons you will be able to do things like plot the populations +for all of these countries in under a minute! + +**Basic layout** + +When you first open RStudio, you will be greeted by three panels: + +- The interactive R console/Terminal (entire left) +- Environment/History/Connections (tabbed in upper right) +- Files/Plots/Packages/Help/Viewer (tabbed in lower right) + +![](fig/01-rstudio.png){alt='RStudio layout'} + +Once you open files, such as R scripts, an editor panel will also open +in the top left. + +![](fig/01-rstudio-script.png){alt='RStudio layout with .R file open'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## R scripts + +Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have `.R` at the end of their names to +let you know what they are. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Workflow within RStudio + +There are two main ways one can work within RStudio: + +1. Test and play within the interactive R console then copy code into + a .R file to run later. + +- This works well when doing small tests and initially starting off. +- It quickly becomes laborious + +2. Start writing in a .R file and use RStudio's short cut keys for the Run command + to push the current line, selected lines or modified lines to the + interactive R console. + +- This is a great way to start; all your code is saved for later +- You will be able to run the file you create from within RStudio + or using R's `source()` function. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running segments of your code + +RStudio offers you great flexibility in running code from within the editor +window. There are buttons, menu choices, and keyboard shortcuts. To run the +current line, you can + +1. click on the `Run` button above the editor panel, or +2. select "Run Lines" from the "Code" menu, or +3. hit Ctrl\+Return in Windows or Linux + or \+Return on OS X. + (This shortcut can also be seen by hovering + the mouse over the button). To run a block of code, select it and then `Run`. + If you have modified a line of code within a block of code you have just run, + there is no need to reselect the section and `Run`, you can use the next button + along, `Re-run the previous region`. This will run the previous code block + including the modifications you have made. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction to R + +Much of your time in R will be spent in the R interactive +console. This is where you will run all of your code, and can be a +useful environment to try out ideas before adding them to an R script +file. This console in RStudio is the same as the one you would get if +you typed in `R` in your command-line environment. + +The first thing you will see in the R interactive session is a bunch +of information, followed by a ">" and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a "Read, evaluate, +print loop": you type in commands, R tries to execute them, and then +returns a result. + +## Using R as a calculator + +The simplest thing you could do with R is to do arithmetic: + +```{r} +1 + 100 +``` + +And R will print out the answer, with a preceding "[1]". [1] is the index of +the first element of the line being printed in the console. For more information +on indexing vectors, see [Episode 6: Subsetting Data](https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/index.html). + +If you type in an incomplete command, R will wait for you to +complete it. If you are familiar with Unix Shell's bash, you may recognize this behavior from bash. + +```r +> 1 + +``` + +```output ++ +``` + +Any time you hit return and the R session shows a "+" instead of a ">", it +means it's waiting for you to complete the command. If you want to cancel +a command you can hit Esc and RStudio will give you back the ">" prompt. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Canceling commands + +If you're using R from the command line instead of from within RStudio, +you need to use Ctrl\+C instead of Esc +to cancel the command. This applies to Mac users as well! + +Canceling a command isn't only useful for killing incomplete commands: +you can also use it to tell R to stop running code (for example if it's +taking much longer than you expect), or to get rid of the code you're +currently writing. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +When using R as a calculator, the order of operations is the same as you +would have learned back in school. + +From highest to lowest precedence: + +- Parentheses: `(`, `)` +- Exponents: `^` or `**` +- Multiply: `*` +- Divide: `/` +- Add: `+` +- Subtract: `-` + +```{r} +3 + 5 * 2 +``` + +Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend. + +```{r} +(3 + 5) * 2 +``` + +This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code. + +```{r, eval=FALSE} +(3 + (5 * (2 ^ 2))) # hard to read +3 + 5 * 2 ^ 2 # clear, if you remember the rules +3 + 5 * (2 ^ 2) # if you forget some rules, this might help +``` + +The text after each line of code is called a +"comment". Anything that follows after the hash (or octothorpe) symbol +`#` is ignored by R when it executes code. + +Really small or large numbers get a scientific notation: + +```{r} +2/10000 +``` + +Which is shorthand for "multiplied by `10^XX`". So `2e-4` +is shorthand for `2 * 10^(-4)`. + +You can write numbers in scientific notation too: + +```{r} +5e3 # Note the lack of minus here +``` + +## Mathematical functions + +R has many built in mathematical functions. To call a function, +we can type its name, followed by open and closing parentheses. +Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example: + +```{r, eval=FALSE} +getwd() #returns an absolute filepath +``` + +doesn't require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result. + +```{r} +sin(1) # trigonometry functions +``` + +```{r} +log(1) # natural logarithm +``` + +```{r} +log10(10) # base-10 logarithm +``` + +```{r} +exp(0.5) # e^(1/2) +``` + +Don't worry about trying to remember every function in R. You +can look them up on Google, or if you can remember the +start of the function's name, use the tab completion in RStudio. + +This is one advantage that RStudio has over R on its own, it +has auto-completion abilities that allow you to more easily +look up functions, their arguments, and the values that they +take. + +Typing a `?` before the name of a command will open the help page +for that command. When using RStudio, this will open the 'Help' pane; +if using R in the terminal, the help page will open in your browser. +The help page will include a detailed description of the command and +how it works. Scrolling to the bottom of the help page will usually +show a collection of code examples which illustrate command usage. +We'll go through an example later. + +## Comparing things + +We can also do comparisons in R: + +```{r} +1 == 1 # equality (note two equals signs, read as "is equal to") +``` + +```{r} +1 != 2 # inequality (read as "is not equal to") +``` + +```{r} +1 < 2 # less than +``` + +```{r} +1 <= 1 # less than or equal to +``` + +```{r} +1 > 0 # greater than +``` + +```{r} +1 >= -9 # greater than or equal to +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Comparing Numbers + +A word of warning about comparing numbers: you should +never use `==` to compare two numbers unless they are +integers (a data type which can specifically represent +only whole numbers). + +Computers may only represent decimal numbers with a +certain degree of precision, so two numbers which look +the same when printed out by R, may actually have +different underlying representations and therefore be +different by a small margin of error (called Machine +numeric tolerance). + +Instead you should use the `all.equal` function. + +Further reading: [http://floating-point-gui.de/](https://floating-point-gui.de/) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Variables and assignment + +We can store values in variables using the assignment operator `<-`, like this: + +```{r} +x <- 1/40 +``` + +Notice that assignment does not print a value. Instead, we stored it for later +in something called a **variable**. `x` now contains the **value** `0.025`: + +```{r} +x +``` + +More precisely, the stored value is a _decimal approximation_ of +this fraction called a [floating point number](https://en.wikipedia.org/wiki/Floating_point). + +Look for the `Environment` tab in the top right panel of RStudio, and you will see that `x` and its value +have appeared. Our variable `x` can be used in place of a number in any calculation that expects a number: + +```{r} +log(x) +``` + +Notice also that variables can be reassigned: + +```{r} +x <- 100 +``` + +`x` used to contain the value 0.025 and now it has the value 100. + +Assignment values can contain the variable being assigned to: + +```{r} +x <- x + 1 #notice how RStudio updates its description of x on the top right tab +y <- x * 2 +``` + +The right hand side of the assignment can be any valid R expression. +The right hand side is _fully evaluated_ before the assignment occurs. + +Variable names can contain letters, numbers, underscores and periods but no spaces. They +must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). +Variables beginning with a period are hidden variables. +Different people use different conventions for long variable names, these include + +- periods.between.words +- underscores\_between\_words +- camelCaseToSeparateWords + +What you use is up to you, but **be consistent**. + +It is also possible to use the `=` operator for assignment: + +```{r} +x = 1/40 +``` + +But this is much less common among R users. The most important thing is to +**be consistent** with the operator you use. There are occasionally places +where it is less confusing to use `<-` than `=`, and it is the most common +symbol used in the community. So the recommendation is to use `<-`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Which of the following are valid R variable names? + +```{r, eval=FALSE} +min_height +max.height +_age +.mass +MaxLength +min-length +2widths +celsius2kelvin +``` + +::::::::::::::: solution + +## Solution to challenge 1 + +The following can be used as R variables: + +```{r ch1pt1-sol, eval=FALSE} +min_height +max.height +MaxLength +celsius2kelvin +``` + +The following creates a hidden variable: + +```{r ch1pt2-sol, eval=FALSE} +.mass +``` + +The following will not be able to be used to create a variable + +```{r ch1pt3-sol, eval=FALSE} +_age +min-length +2widths +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Vectorization + +One final thing to be aware of is that R is _vectorized_, meaning that +variables and functions can have vectors as values. In contrast to physics and +mathematics, a vector in R describes a set of values in a certain order of the +same data type. For example: + +```{r} +1:5 +2^(1:5) +x <- 1:5 +2^x +``` + +This is incredibly powerful; we will discuss this further in an +upcoming lesson. + +## Managing your environment + +There are a few useful commands you can use to interact with the R session. + +`ls` will list all of the variables and functions stored in the global environment +(your working R session): + +```{r, eval=FALSE} +ls() +``` + +```{r, echo=FALSE} +# If `ls()` is left to run by itself when rendering this Rmd document (as would +# happen if the code chunk above was evaluated), the output would contain extra +# items ("args", "dest_md", "op", "src_md") that people following the lesson +# would not see in their own session. +# +# This probably comes from the way the md episodes are generated when the +# lesson website is built. The solution below uses a temporary environment to +# mimick what the learners should observe when running `ls()` on their +# machines. + +temp.env <- new.env() +temp.env$x <- x +temp.env$y <- y +ls(temp.env) +rm(temp.env) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: hidden objects + +Like in the shell, `ls` will hide any variables or functions starting +with a "." by default. To list all objects, type `ls(all.names=TRUE)` +instead + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Note here that we didn't give any arguments to `ls`, but we still +needed to give the parentheses to tell R to call the function. + +If we type `ls` by itself, R prints a bunch of code instead of a listing of objects. + +```{r} +ls +``` + +What's going on here? + +Like everything in R, `ls` is the name of an object, and entering the name of +an object by itself prints the contents of the object. The object `x` that we +created earlier contains `r x`: + +```{r} +x +``` + +The object `ls` contains the R code that makes the `ls` function work! We'll talk +more about how functions work and start writing our own later. + +You can use `rm` to delete objects you no longer need: + +```{r, eval=FALSE} +rm(x) +``` + +If you have lots of things in your environment and want to delete all of them, +you can pass the results of `ls` to the `rm` function: + +```{r, eval=FALSE} +rm(list = ls()) +``` + +In this case we've combined the two. Like the order of operations, anything +inside the innermost parentheses is evaluated first, and so on. + +In this case we've specified that the results of `ls` should be used for the +`list` argument in `rm`. When assigning values to arguments by name, you _must_ +use the `=` operator!! + +If instead we use `<-`, there will be unintended side effects, or you may get an error message: + +```{r, error=TRUE} +rm(list <- ls()) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Warnings vs. Errors + +Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn't worked as expected. + +In both cases, the message that R prints out usually give you clues +how to fix a problem. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## R Packages + +It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages: + +- You can see what packages are installed by typing + `installed.packages()` +- You can install packages by typing `install.packages("packagename")`, + where `packagename` is the package name, in quotes. +- You can update installed packages by typing `update.packages()` +- You can remove a package with `remove.packages("packagename")` +- You can make a package available for use with `library(packagename)` + +Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package. + +Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +What will be the value of each variable after each +statement in the following program? + +```{r, eval=FALSE} +mass <- 47.5 +age <- 122 +mass <- mass * 2.3 +age <- age - 20 +``` + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r ch2pt1-sol} +mass <- 47.5 +``` + +This will give a value of `r mass` for the variable mass + +```{r ch2pt2-sol} +age <- 122 +``` + +This will give a value of `r age` for the variable age + +```{r ch2pt3-sol} +mass <- mass * 2.3 +``` + +This will multiply the existing value of `r mass/2.3` by 2.3 to give a new value of +`r mass` to the variable mass. + +```{r ch2pt4-sol} +age <- age - 20 +``` + +This will subtract 20 from the existing value of `r age + 20 ` to give a new value +of `r age` to the variable age. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age? + +::::::::::::::: solution + +## Solution to challenge 3 + +One way of answering this question in R is to use the `>` to set up the following: + +```{r ch3-sol} +mass > age +``` + +This should yield a boolean value of TRUE since `r mass` is greater than `r age`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Clean up your working environment by deleting the mass and age +variables. + +::::::::::::::: solution + +## Solution to challenge 4 + +We can use the `rm` command to accomplish this task + +```{r ch4-sol} +rm(age, mass) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Install the following packages: `ggplot2`, `plyr`, `gapminder` + +::::::::::::::: solution + +## Solution to challenge 5 + +We can use the `install.packages()` command to install the required packages. + +```{r ch5-sol, eval=FALSE} +install.packages("ggplot2") +install.packages("plyr") +install.packages("gapminder") +``` + +An alternate solution, to install multiple packages with a single `install.packages()` command is: + +```{r ch5-sol2, eval=FALSE} +install.packages(c("ggplot2", "plyr", "gapminder")) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +When installing ggplot2, it may be required for some users to use the dependencies flag as a result of lazy loading affecting the install. This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop: + +```{r ch5-sol3, eval=FALSE} +install.packages("ggplot2", dependencies = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to write and run R programs. +- R has the usual arithmetic operators and mathematical functions. +- Use `<-` to assign values to variables. +- Use `ls()` to list the variables in a program. +- Use `rm()` to delete objects in a program. +- Use `install.packages()` to install packages (libraries). + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/02-project-intro.Rmd b/locale/es/episodes/02-project-intro.Rmd new file mode 100644 index 000000000..74f964f40 --- /dev/null +++ b/locale/es/episodes/02-project-intro.Rmd @@ -0,0 +1,259 @@ +--- +title: Project Management With RStudio +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Create self-contained projects in RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manage my projects in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Introduction + +The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and +eventually everything is a bit mixed together. + + + + +Most people tend to organize their projects like this: + +![](fig/bad_layout.png){alt='Screenshot of file manager demonstrating bad project organisation'} + +There are many reasons why we should _ALWAYS_ avoid this: + +1. It is really hard to tell which version of your data is + the original and which is the modified; +2. It gets really messy because it mixes files with various + extensions together; +3. It probably takes you a lot of time to actually find + things, and relate the correct figures to the exact code + that has been used to generate it; + +A good project layout will ultimately make your life easier: + +- It will help ensure the integrity of your data; +- It makes it simpler to share your code with someone else + (a lab-mate, collaborator, or supervisor); +- It allows you to easily upload your code with your manuscript submission; +- It makes it easier to pick the project back up after a break. + +## A possible solution + +Fortunately, there are tools and packages which can help you manage your work effectively. + +One of the most powerful and useful aspects of RStudio is its project management +functionality. We'll be using this today to create a self-contained, reproducible +project. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1: Creating a self-contained project + +We're going to create a new project in RStudio: + +1. Click the "File" menu button, then "New Project". +2. Click "New Directory". +3. Click "New Project". +4. Type in the name of the directory to store your project, e.g. "my\_project". +5. If available, select the checkbox for "Create a git repository." +6. Click the "Create Project" button. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The simplest way to open an RStudio project once it has been created is to click +through your file system to get to the directory where it was saved and double +click on the `.Rproj` file. This will open RStudio and start your R session in the +same directory as the `.Rproj` file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added benefit of +allowing you to open multiple projects at the same time each open to its own +project directory. This allows you to keep multiple projects open without them +interfering with each other. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2: Opening an RStudio project through the file system + +1. Exit RStudio. +2. Navigate to the directory where you created a project in Challenge 1. +3. Double click on the `.Rproj` file in that directory. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Best practices for project organization + +Although there is no "best" way to lay out a project, there are some general +principles to adhere to that will make project management easier: + +### Treat data as read only + +This is probably the most important goal of setting up a project. Data is +typically time consuming and/or expensive to collect. Working with them +interactively (e.g., in Excel) where they can be modified means you are never +sure of where the data came from, or how it has been modified since collection. +It is therefore a good idea to treat your data as "read-only". + +### Data Cleaning + +In many cases your data will be "dirty": it will need significant preprocessing +to get into a format R (or any other programming language) will find useful. +This task is sometimes called "data munging". Storing these scripts in a +separate folder, and creating a second "read-only" data folder to hold the +"cleaned" data sets can prevent confusion between the two sets. + +### Treat generated output as disposable + +Anything generated by your scripts should be treated as disposable: it should +all be able to be regenerated from your scripts. + +There are lots of different ways to manage this output. Having an output folder +with different sub-directories for each separate analysis makes it easier later. +Since many analyses are exploratory and don't end up being used in the final +project, and some of the analyses get shared between projects. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Good Enough Practices for Scientific Computing + +[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization: + +1. Put each project in its own directory, which is named after the project. +2. Put text documents associated with the project in the `doc` directory. +3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory. +4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory. +5. Name all files to reflect their content or function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Separate function definition and application + +One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console. + +When your project is in its early stages, the initial .R script file usually contains many lines +of directly executed code. As it matures, reusable chunks get pulled into their +own functions. It's a good idea to separate these functions into two separate folders; one +to store useful functions that you'll reuse across analyses and projects, and +one to store the analysis scripts. + +### Save the data in the data directory + +Now we have a good directory structure we will now place/save the data file in the `data/` directory. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Download the gapminder data from [this link to a csv file](data/gapminder_data.csv). + +1. Download the file (right mouse click on the link above -> "Save link as" / "Save file as", or click on the link and after the page loads, press Ctrl\+S or choose File -> "Save page as") +2. Make sure it's saved under the name `gapminder_data.csv` +3. Save the file in the `data/` folder within your project. + +We will load and inspect these data later. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +It is useful to get some general idea about the dataset, directly from the +command line, before loading it into R. Understanding the dataset better +will come in handy when making decisions on how to load it in R. Use the command-line +shell to answer the following questions: + +1. What is the size of the file? +2. How many rows of data does it contain? +3. What kinds of values are stored in this file? + +::::::::::::::: solution + +## Solution to Challenge 4 + +By running these commands in the shell: + +```{r ch2a-sol, engine="sh"} +ls -lh data/gapminder_data.csv +``` + +The file size is 80K. + +```{r ch2b-sol, engine="sh"} +wc -l data/gapminder_data.csv +``` + +There are 1705 lines. The data looks like: + +```{r ch2c-sol, engine="sh"} +head data/gapminder_data.csv +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: command line in RStudio + +The Terminal tab in the console pane provides a convenient place directly +within RStudio to interact directly with the command line. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Working directory + +Knowing R's current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory. + +Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing `.Rproj` file, it will open that project and set R's working directory to the folder that file is in. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +You can check the current working directory with the `getwd()` command, or by using the menus in RStudio. + +1. In the console, type `getwd()` ("wd" is short for "working directory") and hit Enter. +2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click "More" and then select "Go To Working Directory". + +You can change the working directory with `setwd()`, or by using RStudio menus. + +1. In the console, type `setwd("data")` and hit Enter. Type `getwd()` and hit Enter to see the new working directory. +2. In the menus at the top of the RStudio window, click the "Session" menu button, and then select "Set Working Directory" and then "Choose Directory". Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: File does not exist errors + +When you're attempting to reference a file in your R code and you're getting errors saying the file doesn't exist, it's a good idea to check your working directory. +You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Version Control + +It is important to use version control with projects. Go [here for a good lesson which describes using Git with RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html). + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to create and manage projects with consistent layout. +- Treat raw data as read-only. +- Treat generated output as disposable. +- Separate function definition and application. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/03-seeking-help.Rmd b/locale/es/episodes/03-seeking-help.Rmd new file mode 100644 index 000000000..cc2e3f7b8 --- /dev/null +++ b/locale/es/episodes/03-seeking-help.Rmd @@ -0,0 +1,267 @@ +--- +title: Seeking Help +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to read R help files for functions and special operators. +- To be able to use CRAN task views to identify packages to solve a problem. +- To be able to seek help from your peers. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I get help in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Reading Help Files + +R, and every package, provide help files for functions. The general syntax to search for help on any +function, "function\_name", from a specific function that is in a package loaded into your +namespace (your interactive R session) is: + +```{r, eval=FALSE} +?function_name +help(function_name) +``` + +For example take a look at the help file for `write.table()`, we will be using a similar function in an upcoming episode. + +```{r, eval=FALSE} +?write.table() +``` + +This will load up a help page in RStudio (or as plain text in R itself). + +Each help page is broken down into sections: + +- Description: An extended description of what the function does. +- Usage: The arguments of the function and their default values (which can be changed). +- Arguments: An explanation of the data each argument is expecting. +- Details: Any important details to be aware of. +- Value: The data the function returns. +- See Also: Any related functions you might find useful. +- Examples: Some examples for how to use the function. + +Different functions might have different sections, but these are the main ones you should be aware of. + +Notice how related functions might call for the same help file: + +```{r, eval=FALSE} +?write.table() +?write.csv() +``` + +This is because these functions have very similar applicability and often share the same arguments as inputs to the function, so package authors often choose to document them together in a single help file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running Examples + +From within the function help page, you can highlight code in the +Examples and hit Ctrl\+Return to run it in +RStudio console. This gives you a quick way to get a feel for +how a function works. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Reading Help Files + +One of the most daunting aspects of R is the large number of functions +available. It would be prohibitive, if not impossible to remember the +correct usage for every function you use. Luckily, using the help files +means you don't have to remember that! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Special Operators + +To seek help on special operators, use quotes or backticks: + +```{r, eval=FALSE} +?"<-" +?`<-` +``` + +## Getting Help with Packages + +Many packages come with "vignettes": tutorials and extended example documentation. +Without any arguments, `vignette()` will list all vignettes for all installed packages; +`vignette(package="package-name")` will list all available vignettes for +`package-name`, and `vignette("vignette-name")` will open the specified vignette. + +If a package doesn't have any vignettes, you can usually find help by typing +`help("package-name")`. + +RStudio also has a set of excellent +[cheatsheets](https://rstudio.com/resources/cheatsheets/) for many packages. + +## When You Remember Part of the Function Name + +If you're not sure what package a function is in or how it's specifically spelled, you can do a fuzzy search: + +```{r, eval=FALSE} +??function_name +``` + +A fuzzy search is when you search for an approximate string match. For example, you may remember that the function +to set your working directory includes "set" in its name. You can do a fuzzy search to help you identify the function: + +```{r, eval=FALSE} +??set +``` + +## When You Have No Idea Where to Begin + +If you don't know what function or package you need to use +[CRAN Task Views](https://cran.at.r-project.org/web/views) +is a specially maintained list of packages grouped into +fields. This can be a good starting point. + +## When Your Code Doesn't Work: Seeking Help from Your Peers + +If you're having trouble using a function, 9 times out of 10, +the answers you seek have already been answered on +[Stack Overflow](https://stackoverflow.com/). You can search using +the `[r]` tag. Please make sure to see their page on +[how to ask a good question.](https://stackoverflow.com/help/how-to-ask) + +If you can't find the answer, there are a few useful functions to +help you ask your peers: + +```{r, eval=FALSE} +?dput +``` + +Will dump the data you're working with into a format that can +be copied and pasted by others into their own R session. + +```{r} +sessionInfo() +``` + +Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Look at the help page for the `c` function. What kind of vector do you +expect will be created if you evaluate the following: + +```{r, eval=FALSE} +c(1, 2, 3) +c('d', 'e', 'f') +c(1, 2, 'f') +``` + +::::::::::::::: solution + +## Solution to Challenge 1 + +The `c()` function creates a vector, in which all elements are of the +same type. In the first case, the elements are numeric, in the +second, they are characters, and in the third they are also characters: +the numeric values are "coerced" to be characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Look at the help for the `paste` function. You will need to use it later. +What's the difference between the `sep` and `collapse` arguments? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To look at the help for the `paste()` function, use: + +```{r, eval=FALSE} +help("paste") +?paste +``` + +The difference between `sep` and `collapse` is a little +tricky. The `paste` function accepts any number of arguments, each of which +can be a vector of any length. The `sep` argument specifies the string +used between concatenated terms — by default, a space. The result is a +vector as long as the longest argument supplied to `paste`. In contrast, +`collapse` specifies that after concatenation the elements are _collapsed_ +together using the given separator, the result being a single string. + +It is important to call the arguments explicitly by typing out the argument +name e.g `sep = ","` so the function understands to use the "," as a +separator and not a term to concatenate. +e.g. + +```{r} +paste(c("a","b"), "c") +paste(c("a","b"), "c", ",") +paste(c("a","b"), "c", sep = ",") +paste(c("a","b"), "c", collapse = "|") +paste(c("a","b"), "c", sep = ",", collapse = "|") +``` + +(For more information, +scroll to the bottom of the `?paste` help page and look at the +examples, or try `example('paste')`.) + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use help to find a function (and its associated parameters) that you could +use to load data from a tabular file in which columns are delimited with "\\t" +(tab) and the decimal point is a "." (period). This check for decimal +separator is important, especially if you are working with international +colleagues, because different countries have different conventions for the +decimal point (i.e. comma vs period). +Hint: use `??"read table"` to look up functions related to reading in tabular data. + +::::::::::::::: solution + +## Solution to Challenge 3 + +The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +`read.table(file, sep="\t")` (the period is the _default_ decimal +separator for `read.table()`), although you may have to change +the `comment.char` argument as well if your data file contains +hash (#) characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other Resources + +- [Quick R](https://www.statmethods.net/) +- [RStudio cheat sheets](https://www.rstudio.com/resources/cheatsheets/) +- [Cookbook for R](https://www.cookbook-r.com/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `help()` to get online help in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/04-data-structures-part1.Rmd b/locale/es/episodes/04-data-structures-part1.Rmd new file mode 100644 index 000000000..b11c2a52c --- /dev/null +++ b/locale/es/episodes/04-data-structures-part1.Rmd @@ -0,0 +1,1101 @@ +--- +title: Data Structures +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to identify the 5 main data types. +- To begin exploring data frames, and understand how they are related to vectors and lists. +- To be able to ask questions from R about the type, class, and structure of an object. +- To understand the information of the attributes "names", "class", and "dim". + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I read data in R? +- What are the basic data types in R? +- How do I represent categorical information in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +options(stringsAsFactors = FALSE) +cats_orig <- data.frame(coat = c("calico", "black", "tabby"), weight = c(2.1, 5, 3.2), likes_catnip = c(1, 0, 1), stringsAsFactors = FALSE) +cats_bad <- data.frame(coat = c("calico", "black", "tabby", "tabby"), weight = c(2.1, 5, 3.2, "2.3 or 2.4"), likes_catnip = c(1, 0, 1, 1), stringsAsFactors = FALSE) +cats <- cats_orig +``` + +One of R's most powerful features is its ability to deal with tabular data - +such as you may already have in a spreadsheet or a CSV file. Let's start by +making a toy dataset in your `data/` directory, called `feline-data.csv`: + +```{r} +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_catnip = c(1, 0, 1)) +``` + +We can now save `cats` as a CSV file. It is good practice to call the argument +names explicitly so the function knows what default values you are changing. Here we +are setting `row.names = FALSE`. Recall you can use `?write.csv` to pull +up the help file to check out the argument names and their default values. + +```{r} +write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE) +``` + +The contents of the new file, `feline-data.csv`: + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Editing Text files in R + +Alternatively, you can create `data/feline-data.csv` using a text editor (Nano), +or within RStudio with the **File -> New File -> Text File** menu item. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can load this into R via the following: + +```{r} +cats <- read.csv(file = "data/feline-data.csv") +cats +``` + +The `read.table` function is used for reading in tabular data stored in a text +file where the columns of data are separated by punctuation characters such as +CSV files (csv = comma-separated values). Tabs and commas are the most common +punctuation characters used to separate or delimit data points in csv files. +For convenience R provides 2 other versions of `read.table`. These are: `read.csv` +for files where the data are separated with commas and `read.delim` for files +where the data are separated with tabs. Of these three functions `read.csv` is +the most commonly used. If needed it is possible to override the default +delimiting punctuation marks for both `read.csv` and `read.delim`. + +::::::::::::::::::::::::::::::::::::::::: callout + +### Check your data for factors + +In recent times, the default way how R handles textual data has changed. Text +data was interpreted by R automatically into a format called "factors". But +there is an easier format that is called "character". We will hear about +factors later, and what to use them for. For now, remember that in most cases, +they are not needed and only complicate your life, which is why newer R +versions read in text as "character". Check now if your version of R has +automatically created factors and convert them to "character" format: + +1. Check the data types of your input by typing `str(cats)` +2. In the output, look at the three-letter codes after the colons: If you see + only "num" and "chr", you can continue with the lesson and skip this box. + If you find "fct", continue to step 3. +3. Prevent R from automatically creating "factor" data. That can be done by + the following code: `options(stringsAsFactors = FALSE)`. Then, re-read + the cats table for the change to take effect. +4. You must set this option every time you restart R. To not forget this, + include it in your analysis script before you read in any data, for example + in one of the first lines. +5. For R versions greater than 4.0.0, text data is no longer converted to + factors anymore. So you can install this or a newer version to avoid this + problem. If you are working on an institute or company computer, ask your + administrator to do it. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can begin exploring our dataset right away, pulling out columns by specifying +them using the `$` operator: + +```{r} +cats$weight +cats$coat +``` + +We can do other operations on the columns: + +```{r} +## Say we discovered that the scale weighs two Kg light: +cats$weight + 2 +paste("My cat is", cats$coat) +``` + +But what about + +```{r} +cats$weight + cats$coat +``` + +Understanding what happened here is key to successfully analyzing data in R. + +### Data Types + +If you guessed that the last command will return an error because `2.1` plus +`"black"` is nonsense, you're right - and you already have some intuition for an +important concept in programming called _data types_. We can ask what type of +data something is: + +```{r} +typeof(cats$weight) +``` + +There are 5 main types: `double`, `integer`, `complex`, `logical` and `character`. +For historic reasons, `double` is also called `numeric`. + +```{r} +typeof(3.14) +typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers +typeof(1+1i) +typeof(TRUE) +typeof('banana') +``` + +No matter how +complicated our analyses become, all data in R is interpreted as one of these +basic data types. This strictness has some really important consequences. + +A user has added details of another cat. This information is in the file +`data/feline-data_v2.csv`. + +```{r, eval=FALSE} +file.show("data/feline-data_v2.csv") +``` + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +tabby,2.3 or 2.4,1 +``` + +Load the new cats data like before, and check what type of data we find in the +`weight` column: + +```{r} +cats <- read.csv(file="data/feline-data_v2.csv") +typeof(cats$weight) +``` + +Oh no, our weights aren't the double type anymore! If we try to do the same math +we did on them before, we run into trouble: + +```{r} +cats$weight + 2 +``` + +What happened? +The `cats` data we are working with is something called a _data frame_. Data frames +are one of the most common and versatile types of _data structures_ we will work with in R. +A given column in a data frame cannot be composed of different data types. +In this case, R does not read everything in the data frame column `weight` as a _double_, therefore the entire +column data type changes to something that is suitable for everything in the column. + +When R reads a csv file, it reads it in as a _data frame_. Thus, when we loaded the `cats` +csv file, it is stored as a data frame. We can recognize data frames by the first row that +is written by the `str()` function: + +```{r} +str(cats) +``` + +_Data frames_ are composed of rows and columns, where each column has the +same number of rows. Different columns in a data frame can be made up of different +data types (this is what makes them so versatile), but everything in a given +column needs to be the same type (e.g., vector, factor, or list). + +Let's explore more about different data structures and how they behave. +For now, let's remove that extra line from our cats data and reload it, +while we investigate this behavior further: + +feline-data.csv: + +``` +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +And back in RStudio: + +```{r, eval=FALSE} +cats <- read.csv(file="data/feline-data.csv") +``` + +```{r, include=FALSE} +cats <- cats_orig +``` + +### Vectors and Type Coercion + +To better understand this behavior, let's meet another of the data structures: +the _vector_. + +```{r} +my_vector <- vector(length = 3) +my_vector +``` + +A vector in R is essentially an ordered list of things, with the special +condition that _everything in the vector must be the same basic data type_. If +you don't choose the datatype, it'll default to `logical`; or, you can declare +an empty vector of whatever type you like. + +```{r} +another_vector <- vector(mode='character', length=3) +another_vector +``` + +You can check if something is a vector: + +```{r} +str(another_vector) +``` + +The somewhat cryptic output from this command indicates the basic data type +found in this vector - in this case `chr`, character; an indication of the +number of things in the vector - actually, the indexes of the vector, in this +case `[1:3]`; and a few examples of what's actually in the vector - in this case +empty character strings. If we similarly do + +```{r} +str(cats$weight) +``` + +we see that `cats$weight` is a vector, too - _the columns of data we load into R +data.frames are all vectors_, and that's the root of why R forces everything in +a column to be the same basic data type. + +:::::::::::::::::::::::::::::::::::::: discussion + +### Discussion 1 + +Why is R so opinionated about what we put in our columns of data? +How does this help us? + +::::::::::::::: solution + +### Discussion 1 + +By keeping everything in a column the same, we allow ourselves to make simple +assumptions about our data; if you can interpret one entry in the column as a +number, then you can interpret _all_ of them as numbers, so we don't have to +check every time. This consistency is what people mean when they talk about +_clean data_; in the long run, strict consistency goes a long way to making +our lives easier in R. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Coercion by combining vectors + +You can also make vectors with explicit contents with the combine function: + +```{r} +combine_vector <- c(2,6,3) +combine_vector +``` + +Given what we've learned so far, what do you think the following will produce? + +```{r} +quiz_vector <- c(2,6,'3') +``` + +This is something called _type coercion_, and it is the source of many surprises +and the reason why we need to be aware of the basic data types and how R will +interpret them. When R encounters a mix of types (here double and character) to +be combined into a single vector, it will force them all to be the same +type. Consider: + +```{r} +coercion_vector <- c('a', TRUE) +coercion_vector +another_coercion_vector <- c(0, TRUE) +another_coercion_vector +``` + +#### The type hierarchy + +The coercion rules go: `logical` -> `integer` -> `double` ("`numeric`") -> +`complex` -> `character`, where -> can be read as _are transformed into_. For +example, combining `logical` and `character` transforms the result to +`character`: + +```{r} +c('a', TRUE) +``` + +A quick way to recognize `character` vectors is by the quotes that enclose them +when they are printed. + +You can try to force +coercion against this flow using the `as.` functions: + +```{r} +character_vector_example <- c('0','2','4') +character_vector_example +character_coerced_to_double <- as.double(character_vector_example) +character_coerced_to_double +double_coerced_to_logical <- as.logical(character_coerced_to_double) +double_coerced_to_logical +``` + +As you can see, some surprising things can happen when R forces one basic data +type into another! Nitty-gritty of type coercion aside, the point is: if your +data doesn't look like what you thought it was going to look like, type coercion +may well be to blame; make sure everything is the same type in your vectors and +your columns of data.frames, or you will get nasty surprises! + +But coercion can also be very useful! For example, in our `cats` data +`likes_catnip` is numeric, but we know that the 1s and 0s actually represent +`TRUE` and `FALSE` (a common way of representing them). We should use the +`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is +exactly what our data represents. We can 'coerce' this column to be `logical` by +using the `as.logical` function: + +```{r} +cats$likes_catnip +cats$likes_catnip <- as.logical(cats$likes_catnip) +cats$likes_catnip +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 1 + +An important part of every data analysis is cleaning the input data. If you +know that the input data is all of the same format, (e.g. numbers), your +analysis is much easier! Clean the cat data set from the chapter about +type coercion. + +#### Copy the code template + +Create a new script in RStudio and copy and paste the following code. Then +move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_). + +``` +# Read data +cats <- read.csv("data/feline-data_v2.csv") + +# 1. Print the data +_____ + +# 2. Show an overview of the table with all data types +_____(cats) + +# 3. The "weight" column has the incorrect data type __________. +# The correct data type is: ____________. + +# 4. Correct the 4th weight data point with the mean of the two given values +cats$weight[4] <- 2.35 +# print the data again to see the effect +cats + +# 5. Convert the weight to the right data type +cats$weight <- ______________(cats$weight) + +# Calculate the mean to test yourself +mean(cats$weight) + +# If you see the correct mean value (and not NA), you did the exercise +# correctly! +``` + +### Instructions for the tasks + +#### 1\. Print the data + +Execute the first statement (`read.csv(...)`). Then print the data to the +console + +::::::::::::::: solution + +### Tip 1.1 + +Show the content of any variable by typing its name. + +### Solution to Challenge 1.1 + +Two correct solutions: + +``` +cats +print(cats) +``` + +::::::::::::::::::::::::: + +#### 2\. Overview of the data types + +The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of the +`cats` table. + +::::::::::::::: solution + +### Tip 1.2 + +In the chapter "Data types" we saw two functions that can show data types. +One printed just a single word, the data type name. The other printed +a short form of the data type, and the first few values. We need the second +here. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.2 +> +> ``` +> str(cats) +> ``` + +#### 3\. Which data type do we need? + +The shown data type is not the right one for this data (weight of +a cat). Which data type do we need? + +- Why did the `read.csv()` function not choose the correct data type? +- Fill in the gap in the comment with the correct data type for cat weight! + +::::::::::::::: solution + +### Tip 1.3 + +Scroll up to the section about the [type hierarchy](#the-type-hierarchy) +to review the available data types + +::::::::::::::::::::::::: + +::::::::::::::: solution + +### Solution to Challenge 1.3 + +- Weight is expressed on a continuous scale (real numbers). The R + data type for this is "double" (also known as "numeric"). +- The fourth row has the value "2.3 or 2.4". That is not a number + but two, and an english word. Therefore, the "character" data type + is chosen. The whole column is now text, because all values in the same + columns have to be the same data type. + +::::::::::::::::::::::::: + +#### 4\. Correct the problematic value + +The code to assign a new weight value to the problematic fourth row is given. +Think first and then execute it: What will be the data type after assigning +a number like in this example? +You can check the data type after executing to see if you were right. + +::::::::::::::: solution + +### Tip 1.4 + +Revisit the hierarchy of data types when two different data types are +combined. + +::::::::::::::::::::::::: + +> ### Solution to challenge 1.4 +> +> The data type of the column "weight" is "character". The assigned data +> type is "double". Combining two data types yields the data type that is +> higher in the following hierarchy: +> +> ``` +> logical < integer < double < complex < character +> ``` +> +> Therefore, the column is still of type character! We need to manually +> convert it to "double". +> {: .solution} + +#### 5\. Convert the column "weight" to the correct data type + +Cat weight are numbers. But the column does not have this data type yet. +Coerce the column to floating point numbers. + +::::::::::::::: solution + +### Tip 1.5 + +The functions to convert data types start with `as.`. You can look +for the function further up in the manuscript or use the RStudio +auto-complete function: Type "`as.`" and then press the TAB key. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.5 +> +> There are two functions that are synonymous for historic reasons: +> +> ``` +> cats$weight <- as.double(cats$weight) +> cats$weight <- as.numeric(cats$weight) +> ``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Some basic vector functions + +The combine function, `c()`, will also append things to an existing vector: + +```{r} +ab_vector <- c('a', 'b') +ab_vector +combine_example <- c(ab_vector, 'SWC') +combine_example +``` + +You can also make series of numbers: + +```{r} +mySeries <- 1:10 +mySeries +seq(10) +seq(1,10, by=0.1) +``` + +We can ask a few questions about vectors: + +```{r} +sequence_example <- 20:25 +head(sequence_example, n=2) +tail(sequence_example, n=4) +length(sequence_example) +typeof(sequence_example) +``` + +We can get individual elements of a vector by using the bracket notation: + +```{r} +first_element <- sequence_example[1] +first_element +``` + +To change a single element, use the bracket on the other side of the arrow: + +```{r} +sequence_example[1] <- 30 +sequence_example +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 2 + +Start by making a vector with the numbers 1 through 26. +Then, multiply the vector by 2. + +::::::::::::::: solution + +### Solution to Challenge 2 + +```{r} +x <- 1:26 +x <- x * 2 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Lists + +Another data structure you'll want in your bag of tricks is the `list`. A list +is simpler in some ways than the other types, because you can put anything you +want in it. Remember _everything in the vector must be of the same basic data type_, +but a list can have different data types: + +```{r} +list_example <- list(1, "a", TRUE, 1+4i) +list_example +``` + +When printing the object structure with `str()`, we see the data types of all +elements: + +```{r} +str(list_example) +``` + +What is the use of lists? They can **organize data of different types**. For +example, you can organize different tables that belong together, similar to +spreadsheets in Excel. But there are many other uses, too. + +We will see another example that will maybe surprise you in the next chapter. + +To retrieve one of the elements of a list, use the **double bracket**: + +```{r} +list_example[[2]] +``` + +The elements of lists also can have **names**, they can be given by prepending +them to the values, separated by an equals sign: + +```{r} +another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE ) +another_list +``` + +This results in a **named list**. Now we have a new function of our object! +We can access single elements by an additional way! + +```{r} +another_list$title +``` + +## Names + +With names, we can give meaning to elements. It is the first time that we do not +only have the **data**, but also explaining information. It is _metadata_ +that can be stuck to the object like a label. In R, this is called an +**attribute**. Some attributes enable us to do more with our +object, for example, like here, accessing an element by a self-defined name. + +### Accessing vectors and lists by name + +We have already seen how to generate a named list. The way to generate a named +vector is very similar. You have seen this function before: + +```{r} +pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 ) +``` + +The way to retrieve elements is different, though: + +```{r} +pizza_price["pizzasubito"] +``` + +The approach used for the list does not work: + +```{r} +pizza_price$pizzafresh +``` + +It will pay off if you remember this error message, you will meet it in your own +analyses. It means that you have just tried accessing an element like it was in +a list, but it is actually in a vector. + +### Accessing and changing names + +If you are only interested in the names, use the `names()` function: + +```{r} +names(pizza_price) +``` + +We have seen how to access and change single elements of a vector. The same is +possible for names: + +```{r} +names(pizza_price)[3] +names(pizza_price)[3] <- "call-a-pizza" +pizza_price +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 3 + +- What is the data type of the names of `pizza_price`? You can find out + using the `str()` or `typeof()` functions. + +::::::::::::::: solution + +### Solution to Challenge 3 + +You get the names of an object by wrapping the object name inside +`names(...)`. Similarly, you get the data type of the names by again +wrapping the whole code in `typeof(...)`: + +``` +typeof(names(pizza)) +``` + +alternatively, use a new variable if this is easier for you to read: + +``` +n <- names(pizza) +typeof(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 4 + +Instead of just changing some of the names a vector/list already has, you can +also set all names of an object by writing code like (replace ALL CAPS text): + +``` +names( OBJECT ) <- CHARACTER_VECTOR +``` + +Create a vector that gives the number for each letter in the alphabet! + +1. Generate a vector called `letter_no` with the sequence of numbers from 1 + to 26! +2. R has a built-in object called `LETTERS`. It is a 26-character vector, from + A to Z. Set the names of the number sequence to this 26 letters +3. Test yourself by calling `letter_no["B"]`, which should give you the number + 2! + +::::::::::::::: solution + +### Solution to Challenge 4 + +``` +letter_no <- 1:26 # or seq(1,26) +names(letter_no) <- LETTERS +letter_no["B"] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +We have data frames at the very beginning of this lesson, they represent +a table of data. We didn't go much further into detail with our example cat +data frame: + +```{r} +cats +``` + +We can now understand something a bit surprising in our data.frame; what happens +if we run: + +```{r} +typeof(cats) +``` + +We see that data.frames look like lists 'under the hood'. Think again what we +heard about what lists can be used for: + +> Lists organize data of different types + +Columns of a data frame are vectors of different types, that are organized +by belonging to the same table. + +A data.frame is really a list of vectors. It is a special list in which all the +vectors must have the same length. + +How is this "special"-ness written into the object, so that R does not treat it +like any other list, but as a table? + +```{r} +class(cats) +``` + +A **class**, just like names, is an attribute attached to the object. It tells +us what this object means for humans. + +You might wonder: Why do we need another what-type-of-object-is-this-function? +We already have `typeof()`? That function tells us how the object is +**constructed in the computer**. The `class` is the **meaning of the object for +humans**. Consequently, what `typeof()` returns is _fixed_ in R (mainly the +five data types), whereas the output of `class()` is _diverse_ and _extendable_ +by R packages. + +In our `cats` example, we have an integer, a double and a logical variable. As +we have seen already, each column of data.frame is a vector. + +```{r} +cats$coat +cats[,1] +typeof(cats[,1]) +str(cats[,1]) +``` + +Each row is an _observation_ of different variables, itself a data.frame, and +thus can be composed of elements of different types. + +```{r} +cats[1,] +typeof(cats[1,]) +str(cats[1,]) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 5 + +There are several subtly different ways to call variables, observations and +elements from data.frames: + +- `cats[1]` +- `cats[[1]]` +- `cats$coat` +- `cats["coat"]` +- `cats[1, 1]` +- `cats[, 1]` +- `cats[1, ]` + +Try out these examples and explain what is returned by each one. + +_Hint:_ Use the function `typeof()` to examine what is returned in each case. + +::::::::::::::: solution + +### Solution to Challenge 5 + +```{r, eval=TRUE, echo=TRUE} +cats[1] +``` + +We can think of a data frame as a list of vectors. The single brace `[1]` +returns the first slice of the list, as another list. In this case it is the +first column of the data frame. + +```{r, eval=TRUE, echo=TRUE} +cats[[1]] +``` + +The double brace `[[1]]` returns the contents of the list item. In this case +it is the contents of the first column, a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats$coat +``` + +This example uses the `$` character to address items by name. _coat_ is the +first column of the data frame, again a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats["coat"] +``` + +Here we are using a single brace `["coat"]` replacing the index number with +the column name. Like example 1, the returned object is a _list_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, 1] +``` + +This example uses a single brace, but this time we provide row and column +coordinates. The returned object is the value in row 1, column 1. The object +is a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats[, 1] +``` + +Like the previous example we use single braces and provide row and column +coordinates. The row coordinate is not specified, R interprets this missing +value as all the elements in this _column_ and returns them as a _vector_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, ] +``` + +Again we use the single brace with row and column coordinates. The column +coordinate is not specified. The return value is a _list_ containing all the +values in the first row. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Renaming data frame columns + +Data frames have column names, which can be accessed with the `names()` function. + +```{r} +names(cats) +``` + +If you want to rename the second column of `cats`, you can assign a new name to the second element of `names(cats)`. + +```{r} +names(cats)[2] <- "weight_kg" +cats +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +# reverting cats back to original version +cats <- cats_orig +``` + +### Matrices + +Last but not least is the matrix. We can declare a matrix full of zeros: + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +matrix_example +``` + +What makes it special is the `dim()` attribute: + +```{r} +dim(matrix_example) +``` + +And similar to other data structures, we can ask things about our matrix: + +```{r} +typeof(matrix_example) +class(matrix_example) +str(matrix_example) +nrow(matrix_example) +ncol(matrix_example) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? +Try it. +Were you right? Why / why not? + +::::::::::::::: solution + +### Solution to Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +length(matrix_example) +``` + +Because a matrix is a vector with added dimension attributes, `length` +gives you the total number of elements in the matrix. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +::::::::::::::: solution + +### Solution to Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +```{r, eval=FALSE} +x <- matrix(1:50, ncol=5, nrow=10) +x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 8 + +Create a list of length two containing a character vector for each of the sections in this part of the workshop: + +- Data types +- Data structures + +Populate each character vector with the names of the data types and data +structures we've seen so far. + +::::::::::::::: solution + +### Solution to Challenge 8 + +```{r} +dataTypes <- c('double', 'complex', 'integer', 'character', 'logical') +dataStructures <- c('data.frame', 'vector', 'list', 'matrix') +answer <- list(dataTypes, dataStructures) +``` + +Note: it's nice to make a list in big writing on the board or taped to the wall +listing all of these types and structures - leave it up for the rest of the workshop +to remind people of the importance of these basics. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)` +2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)` +3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)` +4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)` + +::::::::::::::: solution + +### Solution to Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +```{r, eval=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `read.csv` to read tabular data in R. +- The basic data types in R are double, integer, complex, logical, and character. +- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/05-data-structures-part2.Rmd b/locale/es/episodes/05-data-structures-part2.Rmd new file mode 100644 index 000000000..abc4d714a --- /dev/null +++ b/locale/es/episodes/05-data-structures-part2.Rmd @@ -0,0 +1,395 @@ +--- +title: Exploring Data Frames +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Add and remove rows or columns. +- Append two data frames. +- Display basic properties of data frames including size and class of the columns, names, and first few rows. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +At this point, you've seen it all: in the last lesson, we toured all the basic +data types and data structures in R. Everything you do will be a manipulation of +those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we'll learn a few more things +about working with data frames. + +## Adding columns and rows in data frames + +We already learned that the columns of a data frame are vectors, so that our +data are consistent in type throughout the columns. As such, if we want to add a +new column, we can start by making a new vector: + +```{r, echo=FALSE} +cats <- read.csv("data/feline-data.csv") +``` + +```{r} +age <- c(2, 3, 5) +cats +``` + +We can then add this as a column via: + +```{r} +cbind(cats, age) +``` + +Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail: + +```{r, error=TRUE} +age <- c(2, 3, 5, 12) +cbind(cats, age) + +age <- c(2, 3) +cbind(cats, age) +``` + +Why didn't this work? Of course, R wants to see one element in our new column +for every row in the table: + +```{r} +nrow(cats) +length(age) +``` + +So for it to work we need to have `nrow(cats)` = `length(age)`. Let's overwrite the content of cats with our new data frame. + +```{r} +age <- c(2, 3, 5) +cats <- cbind(cats, age) +``` + +Now how about adding rows? We already know that the rows of a +data frame are lists: + +```{r} +newRow <- list("tortoiseshell", 3.3, TRUE, 9) +cats <- rbind(cats, newRow) +``` + +Let's confirm that our new row was added correctly. + +```{r} +cats +``` + +## Removing rows + +We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows. + +```{r} +cats +``` + +We can ask for a data frame minus the last row: + +```{r} +cats[-4, ] +``` + +Notice the comma with nothing after it to indicate that we want to drop the entire fourth row. + +Note: we could also remove several rows at once by putting the row numbers +inside of a vector, for example: `cats[c(-3,-4), ]` + +## Removing columns + +We can also remove columns in our data frame. What if we want to remove the column "age". We can remove it in two ways, by variable number or by index. + +```{r} +cats[,-4] +``` + +Notice the comma with nothing before it, indicating we want to keep all of the rows. + +Alternatively, we can drop the column by using the index name and the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `cats`, and asks, "Does this element occur in the second argument?" + +```{r} +drop <- names(cats) %in% c("age") +cats[,!drop] +``` + +We will cover subsetting with logical operators like `%in%` in more detail in the next episode. See the section [Subsetting through other logical operations](06-data-subsetting.Rmd) + +## Appending to a data frame + +The key to remember when adding data to a data frame is that _columns are +vectors and rows are lists._ We can also glue two data frames +together with `rbind`: + +```{r} +cats <- rbind(cats, cats) +cats +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +You can create a new data frame right from within R with the following syntax: + +```{r} +df <- data.frame(id = c("a", "b", "c"), + x = 1:3, + y = c(TRUE, TRUE, FALSE)) +``` + +Make a data frame that holds the following information for yourself: + +- first name +- last name +- lucky number + +Then use `rbind` to add an entry for the people sitting beside you. +Finally, use `cbind` to add a column with each person's answer to the question, "Is it time for coffee break?" + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +df <- data.frame(first = c("Grace"), + last = c("Hopper"), + lucky_number = c(0)) +df <- rbind(df, list("Marie", "Curie", 238) ) +df <- cbind(df, coffeetime = c(TRUE,TRUE)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Realistic example + +So far, you have seen the basics of manipulating data frames with our cat data; +now let's use those skills to digest a more realistic dataset. Let's read in the +`gapminder` dataset that we downloaded previously: + +```{r} +gapminder <- read.csv("data/gapminder_data.csv") +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Miscellaneous Tips + +- Another type of file you might encounter are tab-separated value files (.tsv). To specify a tab as a separator, use `"\\t"` or `read.delim()`. + +- Files can also be downloaded directly from the Internet into a local + folder of your choice onto your computer using the `download.file` function. + The `read.csv` function can then be executed to read the downloaded file from the download location, for example, + +```{r, eval=FALSE, echo=TRUE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv("data/gapminder_data.csv") +``` + +- Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example, + +```{r, eval=FALSE, echo=TRUE} +gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv") +``` + +- You can read directly from excel spreadsheets without + converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package. + +- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's investigate gapminder a bit; the first thing we should always do is check +out what the data looks like with `str`: + +```{r} +str(gapminder) +``` + +An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode. + +```{r} +summary(gapminder) +``` + +Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function: + +```{r} +typeof(gapminder$year) +typeof(gapminder$country) +str(gapminder$country) +``` + +We can also interrogate the data frame for information about its dimensions; +remembering that `str(gapminder)` said there were 1704 observations of 6 +variables in gapminder, what do you think the following will produce, and why? + +```{r} +length(gapminder) +``` + +A fair guess would have been to say that the length of a data frame would be the +number of rows it has (1704), but this is not the case; remember, a data frame +is a _list of vectors and factors_: + +```{r} +typeof(gapminder) +``` + +When `length` gave us 6, it's because gapminder is built out of a list of 6 +columns. To get the number of rows and columns in our dataset, try: + +```{r} +nrow(gapminder) +ncol(gapminder) +``` + +Or, both at once: + +```{r} +dim(gapminder) +``` + +We'll also likely want to know what the titles of all the columns are, so we can +ask for them later: + +```{r} +colnames(gapminder) +``` + +At this stage, it's important to ask ourselves if the structure R is reporting +matches our intuition or expectations; do the basic data types reported for each +column make sense? If not, we need to sort any problems out now before they turn +into bad surprises down the road, using what we've learned about how R +interprets data, and the importance of _strict consistency_ in how we record our +data. + +Once we're happy that the data types and structures seem reasonable, it's time +to start digging into our data proper. Check out the first few lines: + +```{r} +head(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +It's good practice to also check the last few lines of your data and some in the middle. How would you do this? + +Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To check the last few lines it's relatively simple as R already has a function for this: + +```r +tail(gapminder) +tail(gapminder, n = 15) +``` + +What about a few arbitrary rows just in case something is odd in the middle? + +## Tip: There are several ways to achieve this. + +The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it! +Remember my\_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don't know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this. + +```r +gapminder[sample(nrow(gapminder), 5), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Go to file -> new file -> R script, and write an R script +to load in the gapminder dataset. Put it in the `scripts/` +directory and add it to version control. + +Run the script using the `source` function, using the file path +as its argument (or by pressing the "source" button in RStudio). + +::::::::::::::: solution + +## Solution to Challenge 3 + +The `source` function can be used to use a script within a script. +Assume you would like to load the same type of file over and over +again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again +and again you could just write it once and save it as a script. Then, +you can use `source("Your_Script_containing_the_load_function")` in a new +script to use the function of that script without writing everything again. +Check out `?source` to find out more. + +```{r, eval=FALSE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv(file = "data/gapminder_data.csv") +``` + +To run the script and load the data into the `gapminder` variable: + +```{r, eval=FALSE} +source(file = "scripts/load-gapminder.R") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Read the output of `str(gapminder)` again; +this time, use what you've learned about lists and vectors, +as well as the output of functions like `colnames` and `dim` +to explain what everything that `str` prints out for gapminder means. +If there are any parts you can't interpret, discuss with your neighbors! + +::::::::::::::: solution + +## Solution to Challenge 4 + +The object `gapminder` is a data frame with columns + +- `country` and `continent` are character strings. +- `year` is an integer vector. +- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `cbind()` to add a new column to a data frame. +- Use `rbind()` to add a new row to a data frame. +- Remove rows from a data frame. +- Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `head()`, and `typeof()` to understand the structure of a data frame. +- Read in a csv file using `read.csv()`. +- Understand what `length()` of a data frame represents. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/06-data-subsetting.Rmd b/locale/es/episodes/06-data-subsetting.Rmd new file mode 100644 index 000000000..23242457e --- /dev/null +++ b/locale/es/episodes/06-data-subsetting.Rmd @@ -0,0 +1,863 @@ +--- +title: Subsetting Data +teaching: 35 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to subset vectors, factors, matrices, lists, and data frames +- To be able to extract individual and multiple elements: by index, by name, using comparison operations +- To be able to skip and remove elements from various data structures. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I work with subsets of data in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +R has many powerful subset operators. Mastering them will allow you to +easily perform complex operations on any kind of dataset. + +There are six different ways we can subset any kind of object, and three +different subsetting operators for the different data structures. + +Let's start with the workhorse of R: a simple numeric vector. + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +x +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Atomic vectors + +In R, simple vectors containing character strings, numbers, or logical values are called _atomic_ vectors because they can't be further simplified. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +So now that we've created a dummy vector to play with, how do we get at its +contents? + +## Accessing elements using their indices + +To extract elements of a vector we can give their corresponding index, starting +from one: + +```{r} +x[1] +``` + +```{r} +x[4] +``` + +It may look different, but the square brackets operator is a function. For vectors +(and matrices), it means "get me the nth element". + +We can ask for multiple elements at once: + +```{r} +x[c(1, 3)] +``` + +Or slices of the vector: + +```{r} +x[1:4] +``` + +the `:` operator creates a sequence of numbers from the left element to the right. + +```{r} +1:4 +c(1, 2, 3, 4) +``` + +We can ask for the same element multiple times: + +```{r} +x[c(1,1,3)] +``` + +If we ask for an index beyond the length of the vector, R will return a missing value: + +```{r} +x[6] +``` + +This is a vector of length one containing an `NA`, whose name is also `NA`. + +If we ask for the 0th element, we get an empty vector: + +```{r} +x[0] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Vector numbering in R starts at 1 + +In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping and removing elements + +If we use a negative number as the index of a vector, R will return +every element _except_ for the one specified: + +```{r} +x[-2] +``` + +We can skip multiple elements: + +```{r} +x[c(-1, -5)] # or x[-c(1,5)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Order of operations + +A common trip up for novices occurs when trying to skip +slices of a vector. It's natural to try to negate a +sequence like so: + +```{r, error=TRUE, eval=FALSE} +x[-1:3] +``` + +This gives a somewhat cryptic error: + +```{r, error=TRUE, echo=FALSE} +x[-1:3] +``` + +But remember the order of operations. `:` is really a function. +It takes its first argument as -1, and its second as 3, +so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`. + +The correct solution is to wrap that function call in brackets, so +that the `-` operator applies to the result: + +```{r} +x[-(1:3)] +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To remove elements from a vector, we need to assign the result back +into the variable: + +```{r} +x <- x[-4] +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Come up with at least 2 different commands that will produce the following output: + +```{r, echo=FALSE} +x[2:4] +``` + +After you find 2 different commands, compare notes with your neighbour. Did you have different strategies? + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r} +x[2:4] +``` + +```{r} +x[-c(1,5)] +``` + +```{r} +x[c(2,3,4)] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Subsetting by name + +We can extract elements by using their name, instead of extracting by index: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly' +x[c("a", "c")] +``` + +This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same! + +## Subsetting through other logical operations {#logical-operations} + +We can also use any logical vector to subset: + +```{r} +x[c(FALSE, FALSE, TRUE, FALSE, TRUE)] +``` + +Since comparison operators (e.g. `>`, `<`, `==`) evaluate to logical vectors, we can also +use them to succinctly subset vectors: the following statement gives +the same result as the previous one. + +```{r} +x[x > 7] +``` + +Breaking it down, this statement first evaluates `x>7`, generating +a logical vector `c(FALSE, FALSE, TRUE, FALSE, TRUE)`, and then +selects the elements of `x` corresponding to the `TRUE` values. + +We can use `==` to mimic the previous method of indexing by name +(remember you have to use `==` rather than `=` for comparisons): + +```{r} +x[names(x) == "a"] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Combining logical conditions + +We often want to combine multiple logical +criteria. For example, we might want to find all the countries that are +located in Asia **or** Europe **and** have life expectancies within a certain +range. Several operations for combining logical vectors exist in R: + +- `&`, the "logical AND" operator: returns `TRUE` if both the left and right + are `TRUE`. +- `|`, the "logical OR" operator: returns `TRUE`, if either the left or right + (or both) are `TRUE`. + +You may sometimes see `&&` and `||` instead of `&` and `|`. These two-character operators +only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them +for programming, i.e. deciding whether to execute a statement. + +- `!`, the "logical NOT" operator: converts `TRUE` to `FALSE` and `FALSE` to + `TRUE`. It can negate a single logical condition (eg `!TRUE` becomes + `FALSE`), or a whole vector of conditions(eg `!c(TRUE, FALSE)` becomes + `c(FALSE, TRUE)`). + +Additionally, you can compare the elements within a single vector using the +`all` function (which returns `TRUE` if every element of the vector is `TRUE`) +and the `any` function (which returns `TRUE` if one or more elements of the +vector are `TRUE`). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Write a subsetting command to return the values in x that are greater than 4 and less than 7. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r} +x_subset <- x[x<7 & x>4] +print(x_subset) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Non-unique names + +You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have +the same name --- although R tries to avoid this --- but row names +must be unique.) Consider these examples: + +```{r} +x <- 1:3 +x +names(x) <- c('a', 'a', 'a') +x +x['a'] # only returns first value +x[names(x) == 'a'] # returns all three values +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Getting help for operators + +Remember you can search for help on operators by wrapping them in quotes: +`help("%in%")` or `?"%in%"`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping named elements + +Skipping or removing named elements is a little harder. If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn't know how to take the negative of a string: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly' +x[-"a"] +``` + +However, we can use the `!=` (not-equals) operator to construct a logical vector that will do what we want: + +```{r} +x[names(x) != "a"] +``` + +Skipping multiple named indices is a little bit harder still. Suppose we want to drop the `"a"` and `"c"` elements, so we try this: + +```{r} +x[names(x)!=c("a","c")] +``` + +R did _something_, but it gave us a warning that we ought to pay attention to - and it apparently _gave us the wrong answer_ (the `"c"` element is still included in the vector)! + +So what does `!=` actually do in this case? That's an excellent question. + +### Recycling + +Let's take a look at the comparison component of this code: + +```{r} +names(x) != c("a", "c") +``` + +Why does R give `TRUE` as the third element of this vector, when `names(x)[3] != "c"` is obviously false? +When you use `!=`, R tries to compare each element +of the left argument with the corresponding element of its right +argument. What happens when you compare vectors of different lengths? + +![](fig/06-rmd-inequality.1.png){alt='Inequality testing'} + +When one vector is shorter than the other, it gets _recycled_: + +![](fig/06-rmd-inequality.2.png){alt='Inequality testing: results of recycling'} + +In this case R **repeats** `c("a", "c")` as many times as necessary to match `names(x)`, i.e. we get `c("a","c","a","c","a")`. Since the recycled `"a"` +doesn't match the third element of `names(x)`, the value of `!=` is `TRUE`. +Because in this case the longer vector length (5) isn't a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and `names(x)` had contained six elements, R would _silently_ have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs! + +The way to get R to do what we really want (match _each_ element of the left argument with _all_ of the elements of the right argument) it to use the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". Here, since we want to _exclude_ values, we also need a `!` operator to change "in" to "not in": + +```{r} +x[! names(x) %in% c("a","c") ] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains `country` and `continent` variables, but no information between +these two scales. Suppose we want to pull out information from southeast +Asia: how do we set up an operation to produce a logical vector that +is `TRUE` for all of the countries in southeast Asia and `FALSE` otherwise? + +Suppose you have these data: + +```{r} +seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos") +## read in the gapminder data that we downloaded in episode 2 +gapminder <- read.csv("data/gapminder_data.csv", header=TRUE) +## extract the `country` column from a data frame (we'll see this later); +## convert from a factor to a character; +## and get just the non-repeated elements +countries <- unique(as.character(gapminder$country)) +``` + +There's a wrong way (using only `==`), which will give you a warning; +a clunky way (using the logical operators `==` and `|`); and +an elegant way (using `%in%`). See whether you can come up with all three +and explain how they (don't) work. + +::::::::::::::: solution + +## Solution to challenge 3 + +- The **wrong** way to do this problem is `countries==seAsia`. This + gives a warning (`"In countries == seAsia : longer object length is not a multiple of shorter object length"`) and the wrong answer (a vector of all + `FALSE` values), because none of the recycled values of `seAsia` happen + to line up correctly with matching values in `country`. +- The **clunky** (but technically correct) way to do this problem is + +```{r, results="hide"} + (countries=="Myanmar" | countries=="Thailand" | + countries=="Cambodia" | countries == "Vietnam" | countries=="Laos") +``` + +(or `countries==seAsia[1] | countries==seAsia[2] | ...`). This +gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?). + +- The best way to do this problem is `countries %in% seAsia`, which + is both correct and easy to type (and read). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Handling special values + +At some point you will encounter functions in R that cannot handle missing, infinite, +or undefined data. + +There are a number of special functions you can use to filter out this data: + +- `is.na` will return all positions in a vector, matrix, or data.frame + containing `NA` (or `NaN`) +- likewise, `is.nan`, and `is.infinite` will do the same for `NaN` and `Inf`. +- `is.finite` will return all positions in a vector, matrix, or data.frame + that do not contain `NA`, `NaN` or `Inf`. +- `na.omit` will filter out all missing values from a vector + +## Factor subsetting + +Now that we've explored the different ways to subset vectors, how +do we subset the other data structures? + +Factor subsetting works the same way as vector subsetting. + +```{r} +f <- factor(c("a", "a", "b", "c", "c", "d")) +f[f == "a"] +f[f %in% c("b", "c")] +f[1:3] +``` + +Skipping elements will not remove the level +even if no more of that category exists in the factor: + +```{r} +f[-3] +``` + +## Matrix subsetting + +Matrices are also subsetted using the `[` function. In this case +it takes two arguments: the first applying to the rows, the second +to its columns: + +```{r} +set.seed(1) +m <- matrix(rnorm(6*4), ncol=4, nrow=6) +m[3:4, c(3,1)] +``` + +You can leave the first or second arguments blank to retrieve all the +rows or columns respectively: + +```{r} +m[, c(3,4)] +``` + +If we only access one row or column, R will automatically convert the result +to a vector: + +```{r} +m[3,] +``` + +If you want to keep the output as a matrix, you need to specify a _third_ argument; +`drop = FALSE`: + +```{r} +m[3, , drop=FALSE] +``` + +Unlike vectors, if we try to access a row or column outside of the matrix, +R will throw an error: + +```{r, error=TRUE} +m[, c(3,6)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Higher dimensional arrays + +when dealing with multi-dimensional arrays, each argument to `[` +corresponds to a dimension. For example, a 3D array, the first three +arguments correspond to the rows, columns, and depth dimension. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Because matrices are vectors, we can +also subset using only one argument: + +```{r} +m[5] +``` + +This usually isn't useful, and often confusing to read. However it is useful to note that matrices +are laid out in _column-major format_ by default. That is the elements of the +vector are arranged column-wise: + +```{r} +matrix(1:6, nrow=2, ncol=3) +``` + +If you wish to populate the matrix by row, use `byrow=TRUE`: + +```{r} +matrix(1:6, nrow=2, ncol=3, byrow=TRUE) +``` + +Matrices can also be subsetted using their rownames and column names +instead of their row and column indices. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Given the following code: + +```{r} +m <- matrix(1:18, nrow=3, ncol=6) +print(m) +``` + +1. Which of the following commands will extract the values 11 and 14? + +A. `m[2,4,2,5]` + +B. `m[2:5]` + +C. `m[4:5,2]` + +D. `m[2,c(4,5)]` + +::::::::::::::: solution + +## Solution to challenge 4 + +D + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## List subsetting + +Now we'll introduce some new subsetting operators. There are three functions +used to subset lists. We've already seen these when learning about atomic vectors and matrices: `[`, `[[`, and `$`. + +Using `[` will always return a list. If you want to _subset_ a list, but not +_extract_ an element, then you will likely use `[`. + +```{r} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +xlist[1] +``` + +This returns a _list with one element_. + +We can subset elements of a list exactly the same way as atomic +vectors using `[`. Comparison operations however won't work as +they're not recursive, they will try to condition on the data structures +in each element of the list, not the individual elements within those +data structures. + +```{r} +xlist[1:2] +``` + +To extract individual elements of a list, you need to use the double-square +bracket function: `[[`. + +```{r} +xlist[[1]] +``` + +Notice that now the result is a vector, not a list. + +You can't extract more than one element at once: + +```{r, error=TRUE} +xlist[[1:2]] +``` + +Nor use it to skip elements: + +```{r, error=TRUE} +xlist[[-1]] +``` + +But you can use names to both subset and extract elements: + +```{r} +xlist[["a"]] +``` + +The `$` function is a shorthand way for extracting elements by name: + +```{r} +xlist$data +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Given the following list: + +```{r, eval=FALSE} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +``` + +Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +Hint: the number 2 is contained within the "b" item in the list. + +::::::::::::::: solution + +## Solution to challenge 5 + +```{r} +xlist$b[2] +``` + +```{r} +xlist[[2]][2] +``` + +```{r} +xlist[["b"]][2] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 6 + +Given a linear model: + +```{r, eval=FALSE} +mod <- aov(pop ~ lifeExp, data=gapminder) +``` + +Extract the residual degrees of freedom (hint: `attributes()` will help you) + +::::::::::::::: solution + +## Solution to challenge 6 + +```{r, eval=FALSE} +attributes(mod) ## `df.residual` is one of the names of `mod` +``` + +```{r, eval=FALSE} +mod$df.residual +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +Remember the data frames are lists underneath the hood, so similar rules +apply. However they are also two dimensional objects: + +`[` with one argument will act the same way as for lists, where each list +element corresponds to a column. The resulting object will be a data frame: + +```{r} +head(gapminder[3]) +``` + +Similarly, `[[` will act to extract _a single column_: + +```{r} +head(gapminder[["lifeExp"]]) +``` + +And `$` provides a convenient shorthand to extract columns by name: + +```{r} +head(gapminder$year) +``` + +With two arguments, `[` behaves the same way as for matrices: + +```{r} +gapminder[1:3,] +``` + +If we subset a single row, the result will be a data frame (because +the elements are mixed types): + +```{r} +gapminder[3,] +``` + +But for a single column the result will be a vector (this can +be changed with the third argument, `drop = FALSE`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +gapminder[gapminder$year = 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +gapminder[,-1:4] +``` + +3. Extract the rows where the life expectancy is longer the 80 years + +```{r, eval=FALSE} +gapminder[gapminder$lifeExp > 80] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +gapminder[1, 4, 5] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +gapminder[gapminder$year == 2002 | 2007,] +``` + +::::::::::::::: solution + +## Solution to challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +# gapminder[gapminder$year = 1957,] +gapminder[gapminder$year == 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +# gapminder[,-1:4] +gapminder[,-c(1:4)] +``` + +3. Extract the rows where the life expectancy is longer than 80 years + +```{r, eval=FALSE} +# gapminder[gapminder$lifeExp > 80] +gapminder[gapminder$lifeExp > 80,] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +# gapminder[1, 4, 5] +gapminder[1, c(4, 5)] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +# gapminder[gapminder$year == 2002 | 2007,] +gapminder[gapminder$year == 2002 | gapminder$year == 2007,] +gapminder[gapminder$year %in% c(2002, 2007),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 8 + +1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? + +2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 + and 19 through 23. You can do this in one or two steps. + +::::::::::::::: solution + +## Solution to challenge 8 + +1. `gapminder` is a data.frame so needs to be subsetted on two dimensions. `gapminder[1:20, ]` subsets the data to give the first 20 rows and all columns. + +2. + +```{r} +gapminder_small <- gapminder[c(1:9, 19:23),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Indexing in R starts at 1, not 0. +- Access individual values by location using `[]`. +- Access slices of data using `[low:high]`. +- Access arbitrary sets of data using `[c(...)]`. +- Use logical operations and logical vectors to access subsets of data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/07-control-flow.Rmd b/locale/es/episodes/07-control-flow.Rmd new file mode 100644 index 000000000..39946a2c4 --- /dev/null +++ b/locale/es/episodes/07-control-flow.Rmd @@ -0,0 +1,565 @@ +--- +title: Control Flow +teaching: 45 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Write conditional statements with `if...else` statements and `ifelse()`. +- Write and understand `for()` loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make data-dependent choices in R? +- How can I repeat operations in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +set.seed(10) +``` + +Often when we're coding we want to control the flow of our actions. This can be done +by setting actions to occur only if a condition or a set of conditions are met. +Alternatively, we can also set an action to occur a particular number of times. + +There are several ways you can control flow in R. +For conditional statements, the most commonly used approaches are the constructs: + +```{r, eval=FALSE} +# if +if (condition is true) { + perform action +} + +# if ... else +if (condition is true) { + perform action +} else { # that is, if the condition is false, + perform alternative action +} +``` + +Say, for example, that we want R to print a message if a variable `x` has a particular value: + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} + +x +``` + +The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else { + print("x is less than 10") +} +``` + +You can also test multiple conditions by using `else if`. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else if (x > 5) { + print("x is greater than 5, but less than 10") +} else { + print("x is less than 5") +} +``` + +**Important:** when R evaluates the condition inside `if()` statements, it is +looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some +headaches for beginners. For example: + +```{r} +x <- 4 == 3 +if (x) { + "4 equals 3" +} else { + "4 does not equal 3" +} +``` + +As we can see, the not equal message was printed because the vector x is `FALSE` + +```{r} +x <- 4 == 3 +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Use an `if()` statement to print a suitable message +reporting whether there are any records from 2002 in +the `gapminder` dataset. +Now do the same for 2012. + +::::::::::::::: solution + +## Solution to Challenge 1 + +We will first see a solution to Challenge 1 which does not use the `any()` function. +We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`: + +```{r ch10pt1-sol, eval=FALSE} +gapminder[(gapminder$year == 2002),] +``` + +Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002: + +```{r ch10pt2-sol, eval=FALSE} +rows2002_number <- nrow(gapminder[(gapminder$year == 2002),]) +``` + +The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more: + +```{r ch10pt3-sol, eval=FALSE} +rows2002_number >= 1 +``` + +Putting all together, we obtain: + +```{r ch10pt4-sol, eval=FALSE} +if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){ + print("Record(s) for the year 2002 found.") +} +``` + +All this can be done more quickly with `any()`. The logical condition can be expressed as: + +```{r ch10pt5-sol, eval=FALSE} +if(any(gapminder$year == 2002)){ + print("Record(s) for the year 2002 found.") +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Did anyone get a warning message like this? + +```{r, echo=FALSE} +if (gapminder$year == 2012) {} +``` + +The `if()` function only accepts singular (of length 1) inputs, and therefore +returns an error when you use it with a vector. The `if()` function will still +run, but will only evaluate the condition in the first element of the vector. +Therefore, to use the `if()` function, you need to make sure your input is +singular (of length 1). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Built in `ifelse()` function + +`R` accepts both `if()` and `else if()` statements structured as outlined above, +but also statements using `R`'s built-in `ifelse()` function. This +function accepts both singular and vector inputs and is structured as +follows: + +```{r, eval=FALSE} +# ifelse function +ifelse(condition is true, perform action, perform alternative action) + +``` + +where the first argument is the condition or a set of conditions to be met, the +second argument is the statement that is evaluated when the condition is `TRUE`, +and the third statement is the statement that is evaluated when the condition +is `FALSE`. + +```{r} +y <- -3 +ifelse(y < 0, "y is a negative number", "y is either positive or zero") + +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: `any()` and `all()` + +The `any()` function will return `TRUE` if at least one +`TRUE` value is found within a vector, otherwise it will return `FALSE`. +This can be used in a similar way to the `%in%` operator. +The function `all()`, as the name suggests, will only return `TRUE` if all values in +the vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Repeating operations + +If you want to iterate over +a set of values, when the order of iteration is important, and perform the +same operation on each, a `for()` loop will do the job. +We saw `for()` loops in the [shell lessons earlier](https://swcarpentry.github.io/shell-novice/05-loop.html). This is the most +flexible of looping operations, but therefore also the hardest to use +correctly. In general, the advice of many `R` users would be to learn about +`for()` loops, but to avoid using `for()` loops unless the order of iteration is +important: i.e. the calculation at each iteration depends on the results of +previous iterations. If the order of iteration is not important, then you +should learn about vectorized alternatives, such as the `purrr` package, as they +pay off in computational efficiency. + +The basic structure of a `for()` loop is: + +```{r, eval=FALSE} +for (iterator in set of values) { + do a thing +} +``` + +For example: + +```{r} +for (i in 1:10) { + print(i) +} +``` + +The `1:10` bit creates a vector on the fly; you can iterate +over any other vector as well. + +We can use a `for()` loop nested within another `for()` loop to iterate over two things at +once. + +```{r} +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + print(paste(i,j)) + } +} +``` + +We notice in the output that when the first index (`i`) is set to 1, the second +index (`j`) iterates through its full set of indices. Once the indices of `j` +have been iterated through, then `i` is incremented. This process continues +until the last index has been used for each `for()` loop. + +Rather than printing the results, we could write the loop output to a new object. + +```{r} +output_vector <- c() +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + temp_output <- paste(i, j) + output_vector <- c(output_vector, temp_output) + } +} +output_vector +``` + +This approach can be useful, but 'growing your results' (building +the result object incrementally) is computationally inefficient, so avoid +it when you are iterating through a lot of values. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: don't grow your results + +One of the biggest things that trips up novices and +experienced R users alike, is building a results object +(vector, list, matrix, data frame) as your for loop progresses. +Computers are very bad at handling this, so your calculations +can very quickly slow to a crawl. It's much better to define +an empty results object before hand of appropriate dimensions, rather +than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A better way is to define your (empty) output object before filling in the values. +For this example, it looks more involved, but is still more efficient. + +```{r} +output_matrix <- matrix(nrow = 5, ncol = 5) +j_vector <- c('a', 'b', 'c', 'd', 'e') +for (i in 1:5) { + for (j in 1:5) { + temp_j_value <- j_vector[j] + temp_output <- paste(i, temp_j_value) + output_matrix[i, j] <- temp_output + } +} +output_vector2 <- as.vector(output_matrix) +output_vector2 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: While loops + +Sometimes you will find yourself needing to repeat an operation as long as a certain +condition is met. You can do this with a `while()` loop. + +```{r, eval=FALSE} +while(this condition is true){ + do a thing +} +``` + +R will interpret a condition being met as "TRUE". + +As an example, here's a while loop +that generates random numbers from a uniform distribution (the `runif()` function) +between 0 and 1 until it gets one that's less than 0.1. + +```r +z <- 1 +while(z > 0.1){ + z <- runif(1) + cat(z, "\n") +} +``` + +`while()` loops will not always be appropriate. You have to be particularly careful +that you don't end up stuck in an infinite loop because your condition is always met and hence the while statement never terminates. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Compare the objects `output_vector` and +`output_vector2`. Are they the same? If not, why not? +How would you change the last block of code to make `output_vector2` +the same as `output_vector`? + +::::::::::::::: solution + +## Solution to Challenge 2 + +We can check whether the two vectors are identical using the `all()` function: + +```{r ch10pt6-sol, eval=FALSE} +all(output_vector == output_vector2) +``` + +However, all the elements of `output_vector` can be found in `output_vector2`: + +```{r ch10pt7-sol, eval=FALSE} +all(output_vector %in% output_vector2) +``` + +and vice versa: + +```{r ch10pt8-sol, eval=FALSE} +all(output_vector2 %in% output_vector) +``` + +therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order. +This is because `as.vector()` outputs the elements of an input matrix going over its column. +Taking a look at `output_matrix`, we can notice that we want its elements by rows. +The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function +`t()` or by inputting the elements in the right order. +The first solution requires to change the original + +```{r ch10pt9-sol, eval=FALSE} +output_vector2 <- as.vector(output_matrix) +``` + +into + +```{r ch10pt10-sol, eval=FALSE} +output_vector2 <- as.vector(t(output_matrix)) +``` + +The second solution requires to change + +```{r ch10pt11-sol, eval=FALSE} +output_matrix[i, j] <- temp_output +``` + +into + +```{r ch10pt12-sol, eval=FALSE} +output_matrix[j, i] <- temp_output +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Write a script that loops through the `gapminder` data by continent and prints out +whether the mean life expectancy is smaller or larger than 50 +years. + +::::::::::::::: solution + +## Solution to Challenge 3 + +**Step 1**: We want to make sure we can extract all the unique values of the continent vector + +```{r 07-chall-03-sol-a, eval=FALSE} +gapminder <- read.csv("data/gapminder_data.csv") +unique(gapminder$continent) +``` + +**Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data. +We can do that as follows: + +1. Loop over each of the unique values of 'continent' +2. For each value of continent, create a temporary variable storing that subset +3. Return the calculated life expectancy to the user by printing the output: + +```{r 07-chall-03-sol-b, eval=FALSE} +for (iContinent in unique(gapminder$continent)) { + tmp <- gapminder[gapminder$continent == iContinent, ] + cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n") + rm(tmp) +} +``` + +**Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50. +So we need to add an `if()` condition before printing, which evaluates whether the calculated average life expectancy is above or below a threshold, and prints an output conditional on the result. +We need to amend (3) from above: + +3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold: + +```{r 07-chall-03-sol-c, eval=FALSE} +thresholdValue <- 50 + +for (iContinent in unique(gapminder$continent)) { + tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"]) + + if (tmp < thresholdValue){ + cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n") + } else { + cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n") + } # end if else condition + rm(tmp) +} # end for loop + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Modify the script from Challenge 3 to loop over each +country. This time print out whether the life expectancy is +smaller than 50, between 50 and 70, or greater than 70. + +::::::::::::::: solution + +## Solution to Challenge 4 + +We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements: + +```{r 07-chall-04-sol, eval=FALSE} + lowerThreshold <- 50 + upperThreshold <- 70 + +for (iCountry in unique(gapminder$country)) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if(tmp < lowerThreshold) { + cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n") + } else if(tmp > lowerThreshold && tmp < upperThreshold) { + cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n") + } else { + cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n") + } + rm(tmp) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 - Advanced + +Write a script that loops over each country in the `gapminder` dataset, +tests whether the country starts with a 'B', and graphs life expectancy +against time as a line graph if the mean life expectancy is under 50 years. + +::::::::::::::: solution + +## Solution for Challenge 5 + +We will use the `grep()` command that was introduced in the [Unix Shell lesson](https://swcarpentry.github.io/shell-novice/07-find.html) +to find countries that start with "B." +Lets understand how to do this first. +Following from the Unix shell section we may be tempted to try the following + +```{r 07-chall-05-sol-a, eval=FALSE} +grep("^B", unique(gapminder$country)) +``` + +But when we evaluate this command it returns the indices of the factor variable `country` that start with "B." +To get the values, we must add the `value=TRUE` option to the `grep()` command: + +```{r 07-chall-05-sol-b, eval=FALSE} +grep("^B", unique(gapminder$country), value = TRUE) +``` + +We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy using `with()` and `subset()`: + +```{r 07-chall-05-sol-c, eval=FALSE} +thresholdValue <- 50 +candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE) + +for (iCountry in candidateCountries) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if (tmp < thresholdValue) { + cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n") + + with(subset(gapminder, country == iCountry), + plot(year, lifeExp, + type = "o", + main = paste("Life Expectancy in", iCountry, "over time"), + ylab = "Life Expectancy", + xlab = "Year" + ) # end plot + ) # end with + } # end if + rm(tmp) +} # end for loop +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `if` and `else` to make choices. +- Use `for` to repeat operations. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/08-plot-ggplot2.Rmd b/locale/es/episodes/08-plot-ggplot2.Rmd new file mode 100644 index 000000000..12998dc14 --- /dev/null +++ b/locale/es/episodes/08-plot-ggplot2.Rmd @@ -0,0 +1,471 @@ +--- +title: Creating Publication-Quality Graphics with ggplot2 +teaching: 60 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use ggplot2 to generate publication-quality graphics. +- To apply geometry, aesthetic, and statistics layers to a ggplot plot. +- To manipulate the aesthetics of a plot using different colors, shapes, and lines. +- To improve data visualization through transforming scales and paneling by group. +- To save a plot created with ggplot to disk. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I create publication-quality graphics in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Plotting our data is one of the best ways to +quickly explore it and the various relationships +between variables. + +There are three main plotting systems in R, +the [base plotting system][base], the [lattice] +package, and the [ggplot2] package. + +Today we'll be learning about the ggplot2 package, because +it is the most effective for creating publication-quality +graphics. + +ggplot2 is built on the grammar of graphics, the idea that any plot can be +built from the same set of components: a **data set**, +**mapping aesthetics**, and graphical **layers**: + +- **Data sets** are the data that you, the user, provide. + +- **Mapping aesthetics** are what connect the data to the graphics. + They tell ggplot2 how to use your data to affect how the graph looks, + such as changing what is plotted on the X or Y axis, or the size or + color of different data points. + +- **Layers** are the actual graphical output from ggplot2. Layers + determine what kinds of plot are shown (scatterplot, histogram, etc.), + the coordinate system used (rectangular, polar, others), and other + important aspects of the plot. The idea of layers of graphics may + be familiar to you if you have used image editing programs + like Photoshop, Illustrator, or Inkscape. + +Let's start off building an example using the gapminder data from earlier. +The most basic function is `ggplot`, which lets R know that we're +creating a new plot. Any of the arguments we give the `ggplot` +function are the _global_ options for the plot: they apply to all +layers on the plot. + +```{r blank-ggplot, message=FALSE, fig.alt="Blank plot, before adding any mapping aesthetics to ggplot()."} +library("ggplot2") +ggplot(data = gapminder) +``` + +Here we called `ggplot` and told it what data we want to show on +our figure. This is not enough information for `ggplot` to actually +draw anything. It only creates a blank slate for other elements +to be added to. + +Now we're going to add in the **mapping aesthetics** using the +`aes` function. `aes` tells `ggplot` how variables in the **data** +map to _aesthetic_ properties of the figure, such as which columns +of the data should be used for the **x** and **y** locations. + +```{r ggplot-with-aes, message=FALSE, fig.alt="Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +``` + +Here we told `ggplot` we want to plot the "gdpPercap" column of the +gapminder data frame on the x-axis, and the "lifeExp" column on the +y-axis. Notice that we didn't need to explicitly pass `aes` these +columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because +`ggplot` is smart enough to know to look in the **data** for that column! + +The final part of making our plot is to tell `ggplot` how we want to +visually represent the data. We do this by adding a new **layer** +to the plot using one of the **geom** functions. + +```{r lifeExp-vs-gdpPercap-scatter, message=FALSE, fig.alt="Scatter plot of life expectancy vs GDP per capita, now showing the data points."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Here we used `geom_point`, which tells `ggplot` we want to visually +represent the relationship between **x** and **y** as a scatterplot of points. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Modify the example so that the figure shows how life expectancy has +changed over time: + +```{r, eval=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() +``` + +Hint: the gapminder dataset has a column called "year", which should appear +on the x-axis. + +::::::::::::::: solution + +## Solution to challenge 1 + +Here is one possible solution: + +```{r ch1-sol, fig.cap="Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +In the previous examples and challenge we've used the `aes` function to tell +the scatterplot **geom** about the **x** and **y** locations of each point. +Another _aesthetic_ property we can modify is the point _color_. Modify the +code from the previous challenge to **color** the points by the "continent" +column. What trends do you see in the data? Are they what you expected? + +::::::::::::::: solution + +## Solution to challenge 2 + +The solution presented below adds `color=continent` to the call of the `aes` +function. The general trend seems to indicate an increased life expectancy +over the years. On continents with stronger economies we find a longer life +expectancy. + +```{r ch2-sol, fig.cap="Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Layers + +Using a scatterplot probably isn't the best for visualizing change over time. +Instead, let's tell `ggplot` to visualize the data as a line plot: + +```{r lifeExp-line} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) + + geom_line() +``` + +Instead of adding a `geom_point` layer, we've added a `geom_line` layer. + +However, the result doesn't look quite as we might have expected: it seems to be jumping around a lot in each continent. Let's try to separate the data by country, plotting one line for each country: + +```{r lifeExp-line-by} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() +``` + +We've added the **group** _aesthetic_, which tells `ggplot` to draw a line for each +country. + +But what if we want to visualize both lines and points on the plot? We can +add another layer to the plot: + +```{r lifeExp-line-point} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() + geom_point() +``` + +It's important to note that each layer is drawn on top of the previous layer. In +this example, the points have been drawn _on top of_ the lines. Here's a +demonstration: + +```{r lifeExp-layer-example-1} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_line(mapping = aes(color=continent)) + geom_point() +``` + +In this example, the _aesthetic_ mapping of **color** has been moved from the +global plot options in `ggplot` to the `geom_line` layer so it no longer applies +to the points. Now we can clearly see that the points are drawn on top of the +lines. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Setting an aesthetic to a value instead of a mapping + +So far, we've seen how to use an aesthetic (such as **color**) as a _mapping_ to a variable in the data. For example, when we use `geom_line(mapping = aes(color=continent))`, ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that `geom_line(mapping = aes(color="blue"))` should work, but it doesn't. Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the `aes()` function, like this: `geom_line(color="blue")`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Switch the order of the point and line layers from the previous example. What +happened? + +::::::::::::::: solution + +## Solution to challenge 3 + +The lines now get drawn over the points! + +```{r ch3-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_point() + geom_line(mapping = aes(color=continent)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Transformations and statistics + +ggplot2 also makes it easy to overlay statistical models over the data. To +demonstrate we'll go back to our first example: + +```{r lifeExp-vs-gdpPercap-scatter3, message=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Currently it's hard to see the relationship between the points due to some strong +outliers in GDP per capita. We can change the scale of units on the x axis using +the _scale_ functions. These control the mapping between the data values and +visual values of an aesthetic. We can also modify the transparency of the +points, using the _alpha_ function, which is especially helpful when you have +a large amount of data which is very clustered. + +```{r axis-scale, fig.cap="Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread"} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() +``` + +The `scale_x_log10` function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip Reminder: Setting an aesthetic to a value instead of a mapping + +Notice that we used `geom_point(alpha = 0.5)`. As the previous tip mentioned, using a setting outside of the `aes()` function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, _alpha_ can also be mapped to a variable in the data. For example, we can give a different transparency to each continent with `geom_point(mapping = aes(alpha = continent))`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can fit a simple relationship to the data by adding another layer, +`geom_smooth`: + +```{r lm-fit, fig.alt="Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm") +``` + +We can make the line thicker by _setting_ the **linewidth** aesthetic in the +`geom_smooth` layer: + +```{r lm-fit2, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5) +``` + +There are two ways an _aesthetic_ can be specified. Here we _set_ the **linewidth** aesthetic by passing it as an argument to `geom_smooth` and it is applied the same to the whole `geom`. Previously in the lesson we've used the `aes` function to define a _mapping_ between data variables and their visual representation. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4a + +Modify the color and size of the points on the point layer in the previous +example. + +Hint: do not use the `aes` function. + +Hint: the equivalent of `linewidth` for points is `size`. + +::::::::::::::: solution + +## Solution to challenge 4a + +Here a possible solution: +Notice that the `color` argument is supplied outside of the `aes()` function. +This means that it applies to all data points on the graph and is not related to +a specific variable. + +```{r ch4a-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(size=3, color="orange") + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4b + +Modify your solution to Challenge 4a so that the +points are now a different shape and are colored by continent with new +trendlines. Hint: The color argument can be used inside the aesthetic. + +::::::::::::::: solution + +## Solution to challenge 4b + +Here is a possible solution: +Notice that supplying the `color` argument inside the `aes()` functions enables you to +connect it to a certain variable. The `shape` argument, as you can see, modifies all +data points the same way (it is outside the `aes()` call) while the `color` argument which +is placed inside the `aes()` call modifies a point's color based on its continent value. + +```{r ch4b-sol} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point(size=3, shape=17) + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Multi-panel figures + +Earlier we visualized the change in life expectancy over time across all +countries in one plot. Alternatively, we can split this out over multiple panels +by adding a layer of **facet** panels. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to +clutter the figure. Note that we apply a "theme" definition to rotate +the x-axis labels to maintain readability. Nearly everything in +ggplot2 is customizable. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r facet} +americas <- gapminder[gapminder$continent == "Americas",] +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +The `facet_wrap` layer took a "formula" as its argument, denoted by the tilde +(~). This tells R to draw a panel for each unique value in the country column +of the gapminder dataset. + +## Modifying text + +To clean this figure up for a publication we need to change some of the text +elements. The x-axis is too cluttered, and the y axis should read +"Life expectancy", rather than the column name in the data frame. + +We can do this by adding a couple of different layers. The **theme** layer +controls the axis text, and overall text size. Labels for the axes, plot +title and any legend can be set using the `labs` function. Legend titles +are set using the same names we used in the `aes` specification. Thus below +the color legend title is set using `color = "Continent"`, while the title +of a fill legend would be set using `fill = "MyTitle"`. + +```{r theme} +ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + +## Exporting the plot + +The `ggsave()` function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable `lifeExp_plot`, then tell `ggsave` to save that plot in `png` format to a directory called `results`. (Make sure you have a `results/` folder in your working directory.) + +```{r directory-check, echo=FALSE} +if (!dir.exists("results")) { + dir.create("results") +} +``` + +```{r save} +lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + +ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm") +``` + +There are two nice things about `ggsave`. First, it defaults to the last plot, so if you omit the `plot` argument it will automatically save the last plot you created with `ggplot`. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example `.png` or `.pdf`). If you need to, you can specify the format explicitly in the `device` argument. + +This is a taste of what you can do with ggplot2. RStudio provides a +really useful [cheat sheet][cheat] of the different layers available, and more +extensive documentation is available on the [ggplot2 website][ggplot-doc]. All RStudio cheat sheets are available from the [RStudio website][cheat_all]. +Finally, if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow with reusable +code to modify! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Generate boxplots to compare life expectancy between the different continents during the available years. + +Advanced: + +- Rename y axis as Life Expectancy. +- Remove x axis labels. + +::::::::::::::: solution + +## Solution to Challenge 5 + +Here a possible solution: +`xlab()` and `ylab()` set labels for the x and y axes, respectively +The axis title, text and ticks are attributes of the theme and must be modified within a `theme()` call. + +```{r ch5-sol} +ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) + + geom_boxplot() + facet_wrap(~year) + + ylab("Life Expectancy") + + theme(axis.title.x=element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[base]: https://www.statmethods.net/graphs/index.html +[lattice]: https://www.statmethods.net/advgraphs/trellis.html +[ggplot2]: https://www.statmethods.net/advgraphs/ggplot2.html +[cheat]: https://www.rstudio.org/links/data_visualization_cheat_sheet +[cheat_all]: https://www.rstudio.com/resources/cheatsheets/ +[ggplot-doc]: https://ggplot2.tidyverse.org/reference/ + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `ggplot2` to create plots. +- Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/09-vectorization.Rmd b/locale/es/episodes/09-vectorization.Rmd new file mode 100644 index 000000000..9cae732ed --- /dev/null +++ b/locale/es/episodes/09-vectorization.Rmd @@ -0,0 +1,332 @@ +--- +title: Vectorization +teaching: 10 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand vectorized operations in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I operate on all the elements of a vector at once? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +library("ggplot2") +``` + +Most of R's functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through +and act on each element one at a time. This makes writing code more +concise, easy to read, and less error prone. + +```{r} +x <- 1:4 +x * 2 +``` + +The multiplication happened to each element of the vector. + +We can also add two vectors together: + +```{r} +y <- 6:9 +x + y +``` + +Each element of `x` was added to its corresponding element of `y`: + +```{r, eval=FALSE} +x: 1 2 3 4 + + + + + +y: 6 7 8 9 +--------------- + 7 9 11 13 +``` + +Here is how we would add two vectors together using a for loop: + +```{r} +output_vector <- c() +for (i in 1:4) { + output_vector[i] <- x[i] + y[i] +} +output_vector + + +``` + +Compare this to the output using vectorised operations. + +```{r} +sum_xy <- x + y +sum_xy +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +::::::::::::::: solution + +## Solution to challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +```{r} +gapminder$pop_millions <- gapminder$pop / 1e6 +head(gapminder) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +On a single graph, plot population, in +millions, against year, for all countries. Do not worry about +identifying which country is which. + +Repeat the exercise, graphing only for China, India, and +Indonesia. Again, do not worry about which is which. + +::::::::::::::: solution + +## Solution to challenge 2 + +Refresh your plotting skills by plotting population in millions against year. + +```{r ch2-sol, fig.alt="Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled."} +ggplot(gapminder, aes(x = year, y = pop_millions)) + + geom_point() +countryset <- c("China","India","Indonesia") +ggplot(gapminder[gapminder$country %in% countryset,], + aes(x = year, y = pop_millions)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Comparison operators, logical operators, and many functions are also +vectorized: + +**Comparison operators** + +```{r} +x > 2 +``` + +**Logical operators** + +```{r} +a <- x > 3 # or, for clarity, a <- (x > 3) +a +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: some useful functions for logical vectors + +`any()` will return `TRUE` if _any_ element of a vector is `TRUE`.\ +`all()` will return `TRUE` if _all_ elements of a vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Most functions also operate element-wise on vectors: + +**Functions** + +```{r} +x <- 1:4 +log(x) +``` + +Vectorized operations work element-wise on matrices: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m * -1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: element-wise vs. matrix multiplication + +Very important: the operator `*` gives you element-wise multiplication! +To do matrix multiplication, we need to use the `%*%` operator: + +```{r} +m %*% matrix(1, nrow=4, ncol=1) +matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1) +``` + +For more on matrix algebra, see the Quick-R reference +guide + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` +2. `m * c(1, 0, -1)` +3. `m > c(0, 20)` +4. `m * c(1, 0, -1, 2)` + +Did you get the output you expected? If not, ask a helper! + +::::::::::::::: solution + +## Solution to challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` + +```{r, echo=FALSE} +m ^ -1 +``` + +2. `m * c(1, 0, -1)` + +```{r, echo=FALSE} +m * c(1, 0, -1) +``` + +3. `m > c(0, 20)` + +```{r, echo=FALSE} +m > c(0, 20) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000? + +::::::::::::::: solution + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for +high values of n. +Can you use vectorisation to compute x, when n=100? +How about when n=10,000? + +```{r} +sum(1/(1:100)^2) +sum(1/(1:1e04)^2) +n <- 10000 +sum(1/(1:n)^2) +``` + +We can also obtain the same results using a function: + +```{r} +inverse_sum_of_squares <- function(n) { + sum(1/(1:n)^2) +} +inverse_sum_of_squares(100) +inverse_sum_of_squares(10000) +n <- 10000 +inverse_sum_of_squares(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Operations on vectors of unequal length + +Operations can also be performed on vectors of unequal length, through +a process known as _recycling_. This process automatically repeats the smaller vector +until it matches the length of the larger vector. R will provide a warning +if the larger vector is not a multiple of the smaller vector. + +```{r} +x <- c(1, 2, 3) +y <- c(1, 2, 3, 4, 5, 6, 7) +x + y +``` + +Vector `x` was recycled to match the length of vector `y` + +```{r, eval=FALSE} +x: 1 2 3 1 2 3 1 + + + + + + + + +y: 1 2 3 4 5 6 7 +----------------------- + 2 4 6 5 7 9 8 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use vectorized operations instead of loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/10-functions.Rmd b/locale/es/episodes/10-functions.Rmd new file mode 100644 index 000000000..ba405661f --- /dev/null +++ b/locale/es/episodes/10-functions.Rmd @@ -0,0 +1,590 @@ +--- +title: Functions Explained +teaching: 45 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define a function that takes arguments. +- Return a value from a function. +- Check argument conditions with `stopifnot()` in functions. +- Test a function. +- Set default values for function arguments. +- Explain why we should divide programs into small, single-purpose functions. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write a new function in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +If we only had one data set to analyze, it would probably be faster to load the +file into a spreadsheet and use that to plot simple statistics. However, the +gapminder data is updated periodically, and we may want to pull in that new +information later and re-run our analysis again. We may also obtain similar data +from a different source in the future. + +In this lesson, we'll learn how to write a function so that we can repeat +several operations with a single command. + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is a function? + +Functions gather a sequence of operations into a whole, preserving it for +ongoing use. Functions provide: + +- a name we can remember and invoke it by +- relief from the need to remember the individual operations +- a defined set of inputs and expected outputs +- rich connections to the larger programming environment + +As the basic building block of most programming languages, user-defined +functions constitute "programming" as much as any single abstraction can. If +you have written a function, you are a computer programmer. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Defining a function + +Let's open a new R script file in the `functions/` directory and call it +functions-lesson.R. + +The general structure of a function is: + +```{r} +my_function <- function(parameters) { + # perform action + # return value +} +``` + +Let's define a function `fahr_to_kelvin()` that converts temperatures from +Fahrenheit to Kelvin: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +We define `fahr_to_kelvin()` by assigning it to the output of `function`. The +list of argument names are contained within parentheses. Next, the +[body](../learners/reference.md#body) of the function--the +statements that are executed when it runs--is contained within curly braces +(`{}`). The statements in the body are indented by two spaces. This makes the +code easier to read but does not affect how the code operates. + +It is useful to think of creating functions like writing a cookbook. First you define the "ingredients" that your function needs. In this case, we only need one ingredient to use our function: "temp". After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it. + +When we call the function, the values we pass to it as arguments are assigned to +those variables so that we can use them inside the function. Inside the +function, we use a return +statement to send a result back to +whoever asked for it. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the body +of the function. But for clarity, we will explicitly define the +return statement. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's try running our function. +Calling our own function is no different from calling any other function: + +```{r} +# freezing point of water +fahr_to_kelvin(32) +``` + +```{r} +# boiling point of water +fahr_to_kelvin(212) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a function called `kelvin_to_celsius()` that takes a temperature in +Kelvin and returns that temperature in Celsius. + +Hint: To convert from Kelvin to Celsius you subtract 273.15 + +::::::::::::::: solution + +## Solution to challenge 1 + +Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +and returns that temperature in Celsius + +```{r} +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Combining functions + +The real power of functions comes from mixing, matching and combining them +into ever-larger chunks to get the effect we want. + +Let's define two functions that will convert temperature from Fahrenheit to +Kelvin, and Kelvin to Celsius: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} + +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer). + +::::::::::::::: solution + +## Solution to challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above + +```{r} +fahr_to_celsius <- function(temp) { + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Interlude: Defensive Programming + +Now that we've begun to appreciate how writing functions provides an efficient +way to make R code re-usable and modular, we should note that it is important +to ensure that functions only work in their intended use-cases. Checking +function parameters is related to the concept of _defensive programming_. +Defensive programming encourages us to frequently check conditions and throw an +error if something is wrong. These checks are referred to as assertion +statements because we want to assert some condition is `TRUE` before proceeding. +They make it easier to debug because they give us a better idea of where the +errors originate. + +### Checking conditions with `stopifnot()` + +Let's start by re-examining `fahr_to_kelvin()`, our function for converting +temperatures from Fahrenheit to Kelvin. It was defined like so: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +For this function to work as intended, the argument `temp` must be a `numeric` +value; otherwise, the mathematical procedure for converting between the two +temperature scales will not work. To create an error, we can use the function +`stop()`. For example, since the argument `temp` must be a `numeric` vector, we +could check for this condition with an `if` statement and throw an error if the +condition was violated. We could augment our function above like so: + +```{r} +fahr_to_kelvin <- function(temp) { + if (!is.numeric(temp)) { + stop("temp must be a numeric vector.") + } + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +If we had multiple conditions or arguments to check, it would take many lines +of code to check all of them. Luckily R provides the convenience function +`stopifnot()`. We can list as many requirements that should evaluate to `TRUE`; +`stopifnot()` throws an error if it finds one that is `FALSE`. Listing these +conditions also serves a secondary purpose as extra documentation for the +function. + +Let's try out defensive programming with `stopifnot()` by adding assertions to +check the input to our function `fahr_to_kelvin()`. + +We want to assert the following: `temp` is a numeric vector. We may do that like +so: + +```{r} +fahr_to_kelvin <- function(temp) { + stopifnot(is.numeric(temp)) + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +It still works when given proper input. + +```{r} +# freezing point of water +fahr_to_kelvin(temp = 32) +``` + +But fails instantly if given improper input. + +```{r} +# Metric is a factor instead of numeric +fahr_to_kelvin(temp = as.factor(32)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use defensive programming to ensure that our `fahr_to_celsius()` function +throws an error immediately if the argument `temp` is specified +inappropriately. + +::::::::::::::: solution + +## Solution to challenge 3 + +Extend our previous definition of the function by adding in an explicit call +to `stopifnot()`. Since `fahr_to_celsius()` is a composition of two other +functions, checking inside here makes adding checks to the two component +functions redundant. + +```{r} +fahr_to_celsius <- function(temp) { + stopifnot(is.numeric(temp)) + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## More on combining functions + +Now, we're going to define a function that calculates the Gross Domestic Product +of a nation from the data available in our dataset: + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat) { + gdp <- dat$pop * dat$gdpPercap + return(gdp) +} +``` + +We define `calcGDP()` by assigning it to the output of `function`. The list of +argument names are contained within parentheses. Next, the body of the function +\-- the statements executed when you call the function -- is contained within +curly braces (`{}`). + +We've indented the statements in the body by two spaces. This makes the code +easier to read but does not affect how it operates. + +When we call the function, the values we pass to it are assigned to the +arguments, which become variables inside the body of the function. + +Inside the function, we use the `return()` function to send back the result. +This `return()` function is optional: R will automatically return the results of +whatever command is executed on the last line of the function. + +```{r} +calcGDP(head(gapminder)) +``` + +That's not very informative. Let's add some more arguments so we can extract +that per year and country. + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat, year=NULL, country=NULL) { + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } + gdp <- dat$pop * dat$gdpPercap + + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +If you've been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by using the +`source()` function: + +```{r, eval=FALSE} +source("functions/functions-lesson.R") +``` + +Ok, so there's a lot going on in this function now. In plain English, the +function now subsets the provided data by year if the year argument isn't empty, +then subsets the result by country if the country argument isn't empty. Then it +calculates the GDP for whatever subset emerges from the previous two steps. The +function then adds the GDP as a new column to the subsetted data and returns +this as the final result. You can see that the output is much more informative +than a vector of numbers. + +Let's take a look at what happens when we specify the year: + +```{r} +head(calcGDP(gapminder, year=2007)) +``` + +Or for a specific country: + +```{r} +calcGDP(gapminder, country="Australia") +``` + +Or both: + +```{r} +calcGDP(gapminder, year=2007, country="Australia") +``` + +Let's walk through the body of the function: + +```{r, eval=FALSE} +calcGDP <- function(dat, year=NULL, country=NULL) { +``` + +Here we've added two arguments, `year`, and `country`. We've set +_default arguments_ for both as `NULL` using the `=` operator +in the function definition. This means that those arguments will +take on those values unless the user specifies otherwise. + +```{r, eval=FALSE} + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } +``` + +Here, we check whether each additional argument is set to `null`, and whenever +they're not `null` overwrite the dataset stored in `dat` with a subset given by +the non-`null` argument. + +Building these conditionals into the function makes it more flexible for later. +Now, we can use it to calculate the GDP for: + +- The whole dataset; +- A single year; +- A single country; +- A single combination of year and country. + +By using `%in%` instead, we can also give multiple years or countries to those +arguments. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Pass by value + +Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify `dat` inside the function +we are modifying the copy of the gapminder dataset stored in `dat`, +not the original variable we gave as the first argument. + +This is called "pass-by-value" and it makes writing code much safer: +you can always be sure that whatever changes you make within the +body of the function, stay inside the body of the function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Function scope + +Another important concept is scoping: any variables (or functions!) you +create or modify inside the body of a function only exist for the lifetime +of the function's execution. When we call `calcGDP()`, the variables `dat`, +`gdp` and `new` only exist inside the body of the function. Even if we +have variables of the same name in our interactive R session, they are +not modified in any way when executing a function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, eval=FALSE} + gdp <- dat$pop * dat$gdpPercap + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +Finally, we calculated the GDP on our new subset, and created a new data frame +with that column added. This means when we call the function later we can see +the context for the returned GDP values, which is much better than in our first +attempt where we got a vector of numbers. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Test out your GDP function by calculating the GDP for New Zealand in 1987. How +does this differ from New Zealand's GDP in 1952? + +::::::::::::::: solution + +## Solution to challenge 4 + +```{r, eval=FALSE} + calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand") +``` + +GDP for New Zealand in 1987: 65050008703 + +GDP for New Zealand in 1952: 21058193787 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +The `paste()` function can be used to combine text together, e.g: + +```{r} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +paste(best_practice, collapse=" ") +``` + +Write a function called `fence()` that takes two vectors as arguments, called +`text` and `wrapper`, and prints out the text wrapped with the `wrapper`: + +```{r, eval=FALSE} +fence(text=best_practice, wrapper="***") +``` + +_Note:_ the `paste()` function has an argument called `sep`, which specifies +the separator between text. The default is a space: " ". The default for +`paste0()` is no space "". + +::::::::::::::: solution + +## Solution to challenge 5 + +Write a function called `fence()` that takes two vectors as arguments, +called `text` and `wrapper`, and prints out the text wrapped with the +`wrapper`: + +```{r} +fence <- function(text, wrapper){ + text <- c(wrapper, text, wrapper) + result <- paste(text, collapse = " ") + return(result) +} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +fence(text=best_practice, wrapper="***") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the +[R Language Manual][man] or this [chapter] from +[Advanced R Programming][adv-r] by Hadley Wickham. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Testing and documenting + +It's important to both test functions and document them: +Documentation helps you, and others, understand what the +purpose of your function is, and how to use it, and its +important to make sure that your function actually does +what you think. + +When you first start out, your workflow will probably look a lot +like this: + +1. Write a function +2. Comment parts of the function to document its behaviour +3. Load in the source file +4. Experiment with it in the console to make sure it behaves + as you expect +5. Make any necessary bug fixes +6. Rinse and repeat. + +Formal documentation for functions, written in separate `.Rd` +files, gets turned into the documentation you see in help +files. The [roxygen2] package allows R coders to write documentation +alongside the function code and then process it into the appropriate `.Rd` +files. You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In fact, +packages are, in essence, bundles of functions with this formal documentation. +Loading your own functions through `source("functions.R")` is equivalent to +loading someone else's functions (or your own one day!) through +`library("package")`. + +Formal automated tests can be written using the [testthat] package. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[man]: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects +[chapter]: https://adv-r.had.co.nz/Environments.html +[adv-r]: https://adv-r.had.co.nz/ +[roxygen2]: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html +[testthat]: https://r-pkgs.had.co.nz/tests.html + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `function` to define a new function in R. +- Use parameters to pass values into functions. +- Use `stopifnot()` to flexibly check function arguments in R. +- Load functions into programs using `source()`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/11-writing-data.Rmd b/locale/es/episodes/11-writing-data.Rmd new file mode 100644 index 000000000..646e11b7e --- /dev/null +++ b/locale/es/episodes/11-writing-data.Rmd @@ -0,0 +1,188 @@ +--- +title: Writing Data +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to write out plots and data from R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I save plots and data created in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +library("ggplot2") +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +dir.create("cleaned-data") +``` + +## Saving plots + +You have already seen how to save the most recent plot you create in `ggplot2`, +using the command `ggsave`. As a refresher: + +```{r, eval=FALSE} +ggsave("My_most_recent_plot.pdf") +``` + +You can save a plot from within RStudio using the 'Export' button +in the 'Plot' window. This will give you the option of saving as a +.pdf or as .png, .jpg or other image formats. + +Sometimes you will want to save plots without creating them in the +'Plot' window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you're looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can't stop +the loop to click 'Export' for each one. + +In this case you can use a more flexible approach. The function +`pdf` creates a new pdf device. You can control the size and resolution +using the arguments to this function. + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width=12, height=4) +ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) + + geom_line() + + theme(legend.position = "none") + +# You then have to make sure to turn off the pdf device! + +dev.off() +``` + +Open up this document and have a look. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Rewrite your 'pdf' command to print a second +page in the pdf, showing a facet plot (hint: use `facet_grid`) +of the same data with one panel per continent. + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width = 12, height = 4) +p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) + + geom_line() + + theme(legend.position = "none") +p +p + facet_grid(~continent) +dev.off() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The commands `jpeg`, `png` etc. are used similarly to produce +documents in different formats. + +## Writing data + +At some point, you'll also want to write out data from R. + +We can use the `write.table` function for this, which is +very similar to `read.table` from before. + +Let's create a data-cleaning script, for this analysis, we +only want to focus on the gapminder data for Australia: + +```{r} +aust_subset <- gapminder[gapminder$country == "Australia",] + +write.table(aust_subset, + file="cleaned-data/gapminder-aus.csv", + sep="," +) +``` + +Let's switch back to the shell to take a look at the data to make sure it looks +OK: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +Hmm, that's not quite what we wanted. Where did all these +quotation marks come from? Also the row numbers are +meaningless. + +Let's look at the help file to work out how to change this +behaviour. + +```{r, eval=FALSE} +?write.table +``` + +By default R will wrap character vectors with quotation marks +when writing out to file. It will also write out the row and +column names. + +Let's fix this: + +```{r} +write.table( + gapminder[gapminder$country == "Australia",], + file="cleaned-data/gapminder-aus.csv", + sep=",", quote=FALSE, row.names=FALSE +) +``` + +Now lets look at the data again using our shell skills: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +That looks better! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Write a data-cleaning script file that subsets the gapminder +data to include only data points collected since 1990. + +Use this script to write out the new subset to a file +in the `cleaned-data/` directory. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r, eval=FALSE} +write.table( + gapminder[gapminder$year > 1990, ], + file = "cleaned-data/gapminder-after1990.csv", + sep = ",", quote = FALSE, row.names = FALSE +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE} +# We remove after rendering the lesson, because we don't want this in the lesson +# repository +unlink("cleaned-data", recursive=TRUE) +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Save plots from RStudio using the 'Export' button. +- Use `write.table` to save tabular data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/12-dplyr.Rmd b/locale/es/episodes/12-dplyr.Rmd new file mode 100644 index 000000000..0f5540883 --- /dev/null +++ b/locale/es/episodes/12-dplyr.Rmd @@ -0,0 +1,487 @@ +--- +title: Data Frame Manipulation with dplyr +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use the six main data frame manipulation 'verbs' with pipes in `dplyr`. +- To understand how `group_by()` and `summarize()` can be combined to summarize datasets. +- Be able to analyze a subset of data using logical filtering. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate data frames without repeating myself? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Manipulation of data frames means many things to many researchers: we often +select certain observations (rows) or variables (columns), we often group the +data by a certain variable(s), or we even calculate summary statistics. We can +do these operations using the normal base R operations: + +```{r} +mean(gapminder$gdpPercap[gapminder$continent == "Africa"]) +mean(gapminder$gdpPercap[gapminder$continent == "Americas"]) +mean(gapminder$gdpPercap[gapminder$continent == "Asia"]) +``` + +But this isn't very _nice_ because there is a fair bit of repetition. Repeating +yourself will cost you time, both now and later, and potentially introduce some +nasty bugs. + +## The `dplyr` package + +Luckily, the [`dplyr`](https://cran.r-project.org/package=dplyr) +package provides a number of very useful functions for manipulating data frames +in a way that will reduce the above repetition, reduce the probability of making +errors, and probably even save you some typing. As an added bonus, you might +even find the `dplyr` grammar easier to read. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Tidyverse + +`dplyr` package belongs to a broader family of opinionated R packages +designed for data science called the "Tidyverse". These +packages are specifically designed to work harmoniously together. +Some of these packages will be covered along this course, but you can find more +complete information here: [https://www.tidyverse.org/](https://www.tidyverse.org/). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Here we're going to cover 5 of the most commonly used functions as well as using +pipes (`%>%`) to combine them. + +1. `select()` +2. `filter()` +3. `group_by()` +4. `summarize()` +5. `mutate()` + +If you have have not installed this package earlier, please do so: + +```{r, eval=FALSE} +install.packages('dplyr') +``` + +Now let's load the package: + +```{r, message=FALSE} +library("dplyr") +``` + +## Using select() + +If, for example, we wanted to move forward with only a few of the variables in +our data frame we could use the `select()` function. This will keep only the +variables you select. + +```{r} +year_country_gdp <- select(gapminder, year, country, gdpPercap) +``` + +![](fig/13-dplyr-fig1.png){alt='Diagram illustrating use of select function to select two columns of a data frame'} +If we want to remove one column only from the `gapminder` data, for example, +removing the `continent` column. + +```{r} +smaller_gapminder_data <- select(gapminder, -continent) +``` + +If we open up `year_country_gdp` we'll see that it only contains the year, +country and gdpPercap. Above we used 'normal' grammar, but the strengths of +`dplyr` lie in combining several functions using pipes. Since the pipes grammar +is unlike anything we've seen in R before, let's repeat what we've done above +using pipes. + +```{r} +year_country_gdp <- gapminder %>% select(year, country, gdpPercap) +``` + +To help you understand why we wrote that in that way, let's walk through it step +by step. First we summon the gapminder data frame and pass it on, using the pipe +symbol `%>%`, to the next step, which is the `select()` function. In this case +we don't specify which data object we use in the `select()` function since in +gets that from the previous pipe. **Fun Fact**: There is a good chance you have +encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the +shell it is `|` but the concept is the same! + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Renaming data frame columns in dplyr + +In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the `names()` function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a `rename()` function. + +Within a pipeline, the syntax is `rename(new_name = old_name)`. +For example, we may want to rename the gdpPercap column name from our `select()` statement above. + +```{r} +tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap) + +head(tidy_gdp) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Using filter() + +If we now want to move forward with the above, but only with European +countries, we can combine `select` and `filter` + +```{r} +year_country_gdp_euro <- gapminder %>% + filter(continent == "Europe") %>% + select(year, country, gdpPercap) +``` + +If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below. + +```{r} +europe_lifeExp_2007 <- gapminder %>% + filter(continent == "Europe", year == 2007) %>% + select(country, lifeExp) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a single command (which can span multiple lines and includes pipes) that +will produce a data frame that has the African values for `lifeExp`, `country` +and `year`, but not for other Continents. How many rows does your data frame +have and why? + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +year_country_lifeExp_Africa <- gapminder %>% + filter(continent == "Africa") %>% + select(year, country, lifeExp) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +As with last time, first we pass the gapminder data frame to the `filter()` +function, then we pass the filtered version of the gapminder data frame to the +`select()` function. **Note:** The order of operations is very important in this +case. If we used 'select' first, filter would not be able to find the variable +continent since we would have removed it in the previous step. + +## Using group\_by() + +Now, we were supposed to be reducing the error prone repetitiveness of what can +be done with base R, but up to now we haven't done that since we would have to +repeat the above for each continent. Instead of `filter()`, which will only pass +observations that meet your criteria (in the above: `continent=="Europe"`), we +can use `group_by()`, which will essentially use every unique criteria that you +could have used in filter. + +```{r} +str(gapminder) + +str(gapminder %>% group_by(continent)) +``` + +You will notice that the structure of the data frame where we used `group_by()` +(`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A +`grouped_df` can be thought of as a `list` where each item in the `list`is a +`data.frame` which contains only the rows that correspond to the a particular +value `continent` (at least in the example above). + +![](fig/13-dplyr-fig2.png){alt='Diagram illustrating how the group by function oraganizes a data frame into groups'} + +## Using summarize() + +The above was a bit on the uneventful side but `group_by()` is much more +exciting in conjunction with `summarize()`. This will allow us to create new +variable(s) by using functions that repeat for each of the continent-specific +data frames. That is to say, using the `group_by()` function, we split our +original data frame into multiple pieces, then we can run functions +(e.g. `mean()` or `sd()`) within `summarize()`. + +```{r} +gdp_bycontinents <- gapminder %>% + group_by(continent) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +![](fig/13-dplyr-fig3.png){alt='Diagram illustrating the use of group by and summarize together to create a new variable'} + +```{r, eval=FALSE} +continent mean_gdpPercap + +1 Africa 2193.755 +2 Americas 7136.110 +3 Asia 7902.150 +4 Europe 14469.476 +5 Oceania 18621.609 +``` + +That allowed us to calculate the mean gdpPercap for each continent, but it gets +even better. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Calculate the average life expectancy per country. Which has the longest average life +expectancy and which has the shortest average life expectancy? + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +lifeExp_bycountry <- gapminder %>% + group_by(country) %>% + summarize(mean_lifeExp = mean(lifeExp)) +lifeExp_bycountry %>% + filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp)) +``` + +Another way to do this is to use the `dplyr` function `arrange()`, which +arranges the rows in a data frame according to the order of one or more +variables from the data frame. It has similar syntax to other functions from +the `dplyr` package. You can use `desc()` inside `arrange()` to sort in +descending order. + +```{r} +lifeExp_bycountry %>% + arrange(mean_lifeExp) %>% + head(1) +lifeExp_bycountry %>% + arrange(desc(mean_lifeExp)) %>% + head(1) +``` + +Alphabetical order works too + +```{r} +lifeExp_bycountry %>% + arrange(desc(country)) %>% + head(1) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::: + +The function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`. + +```{r} +gdp_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop)) +``` + +## count() and n() + +A very common operation is to count the number of observations for each +group. The `dplyr` package comes with two related functions that help with this. + +For instance, if we wanted to check the number of countries included in the +dataset for the year 2002, we can use the `count()` function. It takes the name +of one or more columns that contain the groups we are interested in, and we can +optionally sort the results in descending order by adding `sort=TRUE`: + +```{r} +gapminder %>% + filter(year == 2002) %>% + count(continent, sort = TRUE) +``` + +If we need to use the number of observations in calculations, the `n()` function +is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize(se_le = sd(lifeExp)/sqrt(n())) +``` + +You can also chain together several summary operations; in this case calculating the `minimum`, `maximum`, `mean` and `se` of each continent's per-country life-expectancy: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize( + mean_le = mean(lifeExp), + min_le = min(lifeExp), + max_le = max(lifeExp), + se_le = sd(lifeExp)/sqrt(n())) +``` + +## Using mutate() + +We can also create new variables prior to (or even after) summarizing information using `mutate()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + mutate(gdp_billion = gdpPercap*pop/10^9) %>% + group_by(continent,year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) +``` + +## Connect mutate with logical filtering: ifelse + +When creating new variables, we can hook this with a logical condition. A simple combination of +`mutate()` and `ifelse()` facilitates filtering right where it is needed: in the moment of creating something new. +This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension +of the data frame will not change) or for updating values depending on this given condition. + +```{r} +## keeping all data but "filtering" after a certain condition +# calculate GDP only for people with a life expectation above 25 +gdp_pop_bycontinents_byyear_above25 <- gapminder %>% + mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) + +## updating only if certain condition is fullfilled +# for life expectations above 40 years, the gpd to be expected in the future is scaled +gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>% + mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + mean_gdpPercap_expected = mean(gdp_futureExpectation)) +``` + +## Combining `dplyr` and `ggplot2` + +First install and load ggplot2: + +```{r, eval=FALSE} +install.packages('ggplot2') +``` + +```{r, message=FALSE} +library("ggplot2") +``` + +In the plotting lesson we looked at how to make a multi-panel figure by adding +a layer of facet panels using `ggplot2`. Here is the code we used (with some +extra comments): + +```{r} +# Filter countries located in the Americas +americas <- gapminder[gapminder$continent == "Americas", ] +# Make the plot +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +This code makes the right plot but it also creates an intermediate variable +(`americas`) that we might not have any other uses for. Just as we used +`%>%` to pipe data along a chain of `dplyr` functions we can use it to pass data +to `ggplot()`. Because `%>%` replaces the first argument in a function we don't +need to specify the `data =` argument in the `ggplot()` function. By combining +`dplyr` and `ggplot2` functions we can make the same figure without creating any +new variables or modifying the data. + +```{r} +gapminder %>% + # Filter countries located in the Americas + filter(continent == "Americas") %>% + # Make the plot + ggplot(mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +More examples of using the function `mutate()` and the `ggplot2` package. + +```{r} +gapminder %>% + # extract first letter of country name into new column + mutate(startsWith = substr(country, 1, 1)) %>% + # only keep countries starting with A or Z + filter(startsWith %in% c("A", "Z")) %>% + # plot lifeExp into facets + ggplot(aes(x = year, y = lifeExp, colour = continent)) + + geom_line() + + facet_wrap(vars(country)) + + theme_minimal() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge + +Calculate the average life expectancy in 2002 of 2 randomly selected countries +for each continent. Then arrange the continent names in reverse order. +**Hint:** Use the `dplyr` functions `arrange()` and `sample_n()`, they have +similar syntax to other dplyr functions. + +::::::::::::::: solution + +## Solution to Advanced Challenge + +```{r} +lifeExp_2countries_bycontinents <- gapminder %>% + filter(year==2002) %>% + group_by(continent) %>% + sample_n(2) %>% + summarize(mean_lifeExp=mean(lifeExp)) %>% + arrange(desc(mean_lifeExp)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to dplyr](https://dplyr.tidyverse.org/) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) +- [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) (online book) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `dplyr` package to manipulate data frames. +- Use `select()` to choose variables from a data frame. +- Use `filter()` to choose data based on values. +- Use `group_by()` and `summarize()` to work with subsets of data. +- Use `mutate()` to create new variables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/13-tidyr.Rmd b/locale/es/episodes/13-tidyr.Rmd new file mode 100644 index 000000000..96e59d18d --- /dev/null +++ b/locale/es/episodes/13-tidyr.Rmd @@ -0,0 +1,321 @@ +--- +title: Data Frame Manipulation with tidyr +teaching: 30 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand the concepts of 'longer' and 'wider' data frame formats and be able to convert between them with `tidyr`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I change the layout of a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE, stringsAsFactors = FALSE) +gap_wide <- read.csv("data/gapminder_wide.csv", header = TRUE, stringsAsFactors = FALSE) +``` + +Researchers often want to reshape their data frames from 'wide' to 'longer' +layouts, or vice-versa. The 'long' layout or format is where: + +- each column is a variable +- each row is an observation + +In the purely 'long' (or 'longest') format, you usually have 1 column for the observed variable and the other columns are ID variables. + +For the 'wide' format each row is often a site/subject/patient and you have +multiple observation variables containing the same type of data. These can be +either repeated observations over time, or observation of multiple variables (or +a mix of both). You may find data input may be simpler or some other +applications may prefer the 'wide' format. However, many of `R`'s functions have +been designed assuming you have 'longer' formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format. + +![](fig/14-tidyr-fig1.png){alt='Diagram illustrating the difference between a wide versus long layout of a data frame'} + +Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due +to its shape. However, the long format is more machine readable and is closer +to the formatting of databases. The ID variables in our data frames are similar to +the fields in a database and observed variables are like the database values. + +## Getting started + +First install the packages if you haven't already done so (you probably +installed dplyr in the previous lesson): + +```{r, eval=FALSE} +#install.packages("tidyr") +#install.packages("dplyr") +``` + +Load the packages + +```{r, message=FALSE} +library("tidyr") +library("dplyr") +``` + +First, lets look at the structure of our original gapminder data frame: + +```{r} +str(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Is gapminder a purely long, purely wide, or some intermediate format? + +::::::::::::::: solution + +## Solution to Challenge 1 + +The original gapminder data.frame is in an intermediate format. It is not +purely long since it had multiple observation variables +(`pop`,`lifeExp`,`gdpPercap`). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Sometimes, as with the gapminder dataset, we have multiple types of observed +data. It is somewhere in between the purely 'long' and 'wide' data formats. We +have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation +variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be +preferred despite not having ALL observations in 1 column given that all 3 +observation variables have different units. There are few operations that would +need us to make this data frame any longer (i.e. 4 ID variables and 1 +Observation variable). + +While using many of the functions in R, which are often vector based, you +usually do not want to do mathematical operations on values with different +units. For example, using the purely long format, a single mean for all of the +values of population, life expectancy, and GDP would not be meaningful since it +would return the mean of values with 3 incompatible units. The solution is that +we first manipulate the data either by grouping (see the lesson on `dplyr`), or +we change the structure of the data frame. **Note:** Some plotting functions in +R actually work better in the wide format data. + +## From wide to long format with pivot\_longer() + +Until now, we've been using the nicely formatted original gapminder dataset, but +'real' data (i.e. our own research data) will never be so well organized. Here +let's start with the wide formatted version of the gapminder dataset. + +> Download the wide version of the gapminder data from [this link to a csv file](data/gapminder_wide.csv) +> and save it in your data folder. + +We'll load the data file and look at it. Note: we don't want our continent and +country columns to be factors, so we use the stringsAsFactors argument for +`read.csv()` to disable that. + +```{r} +gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE) +str(gap_wide) +``` + +![](fig/14-tidyr-fig2.png){alt='Diagram illustrating the wide format of the gapminder data frame'} + +To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable. + +![](fig/14-tidyr-fig3.png){alt='Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +Here we have used piping syntax which is similar to what we were doing in the +previous lesson with dplyr. In fact, these are compatible and you can use a mix +of tidyr and dplyr functions by piping them together. + +We first provide to `pivot_longer()` a vector of column names that will be +pivoted into longer format. We could type out all the observation variables, but +as in the `select()` function (see `dplyr` lesson), we can use the `starts_with()` +argument to select all variables that start with the desired character string. +`pivot_longer()` also allows the alternative syntax of using the `-` symbol to +identify which variables are not to be pivoted (i.e. ID variables). + +The next arguments to `pivot_longer()` are `names_to` for naming the column that +will contain the new ID variable (`obstype_year`) and `values_to` for naming the +new amalgamated observation variable (`obs_value`). We supply these new column +names as strings. + +![](fig/14-tidyr-fig4.png){alt='Diagram illustrating the long format of the gapminder data'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(-continent, -country), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +That may seem trivial with this particular data frame, but sometimes you have 1 +ID variable and 40 observation variables with irregular variable names. The +flexibility is a huge time saver! + +Now `obstype_year` actually contains 2 pieces of information, the observation +type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the +`separate()` function to split the character strings into multiple variables + +```{r} +gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") +gap_long$year <- as.integer(gap_long$year) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Using `gap_long`, calculate the mean life expectancy, population, and gdpPercap for each continent. +**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +gap_long %>% group_by(continent, obs_type) %>% + summarize(means=mean(obs_values)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## From long to intermediate format with pivot\_wider() + +It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let's start with the intermediate format. + +The `pivot_wider()` function takes `names_from` and `values_from` arguments. + +To `names_from` we supply the column name whose contents will be pivoted into new +output columns in the widened data frame. The corresponding values will be added +from the column named in the `values_from` argument. + +```{r} +gap_normal <- gap_long %>% + pivot_wider(names_from = obs_type, values_from = obs_values) +dim(gap_normal) +dim(gapminder) +names(gap_normal) +names(gapminder) +``` + +Now we've got an intermediate data frame `gap_normal` with the same dimensions as +the original `gapminder`, but the order of the variables is different. Let's fix +that before checking if they are `all.equal()`. + +```{r} +gap_normal <- gap_normal[, names(gapminder)] +all.equal(gap_normal, gapminder) +head(gap_normal) +head(gapminder) +``` + +We're almost there, the original was sorted by `country`, then +`year`. + +```{r} +gap_normal <- gap_normal %>% arrange(country, year) +all.equal(gap_normal, gapminder) +``` + +That's great! We've gone from the longest format back to the intermediate and we +didn't introduce any errors in our code. + +Now let's convert the long all the way back to the wide. In the wide format, we +will keep country and continent as ID variables and pivot the observations +across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we +need to create appropriate labels for all our new variables (time\*metric +combinations) and we also need to unify our ID variables to simplify the process +of defining `gap_wide`. + +```{r} +gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_") +str(gap_temp) + +gap_temp <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") +str(gap_temp) +``` + +Using `unite()` we now have a single ID variable which is a combination of +`continent`,`country`,and we have defined variable names. We're now ready to +pipe in `pivot_wider()` + +```{r} +gap_wide_new <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +str(gap_wide_new) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Take this 1 step further and create a `gap_ludicrously_wide` format data by pivoting over countries, year and the 3 metrics? +**Hint** this new data frame should only have 5 rows. + +::::::::::::::: solution + +## Solution to Challenge 3 + +```{r} +gap_ludicrously_wide <- gap_long %>% + unite(var_names, obs_type, year, country, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now we have a great 'wide' format data frame, but the `ID_var` could be more +usable, let's separate it into 2 variables with `separate()` + +```{r} +gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_") +gap_wide_betterID <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) %>% + separate(ID_var, c("continent","country"), sep = "_") +str(gap_wide_betterID) + +all.equal(gap_wide, gap_wide_betterID) +``` + +There and back again! + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `tidyr` package to change the layout of data frames. +- Use `pivot_longer()` to go from wide to longer layout. +- Use `pivot_wider()` to go from long to wider layout. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/14-knitr-markdown.Rmd b/locale/es/episodes/14-knitr-markdown.Rmd new file mode 100644 index 000000000..5829180aa --- /dev/null +++ b/locale/es/episodes/14-knitr-markdown.Rmd @@ -0,0 +1,493 @@ +--- +title: Producing Reports With knitr +teaching: 60 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the value of writing reproducible reports +- Learn how to recognise and compile the basic components of an R Markdown file +- Become familiar with R code chunks, and understand their purpose, structure and options +- Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations +- Be aware of alternative output formats to which an R Markdown file can be exported + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I integrate software and reports? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r chunk_options, include=FALSE} +``` + +## Data analysis reports + +Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their +work for future reference. + +Many new users begin by first writing a single R script containing all of their +work, and then share the analysis by emailing the script and various graphs +as attachments. But this can be cumbersome, requiring a lengthy discussion to +explain which attachment was which result. + +Writing formal reports with Word or [LaTeX](https://www.latex-project.org/) +can simplify this process by incorporating both the analysis report and output graphs +into a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy "whack-a-mole" +game of fixing new mistakes resulting from a single formatting change. + +Creating a report as a web page (which is an html file) using R Markdown makes things easier. +The report can be one long stream, so tall figures that wouldn't ordinarily fit on +one page can be kept at full size and easier to read, since the reader can simply +keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend +more time on your analyses instead of writing reports. + +## Literate programming + +Ideally, such analysis reports are _reproducible_ documents: If an +error is discovered, or if some additional subjects are added to the +data, you can just re-compile the report and get the new or corrected +results rather than having to reconstruct figures, paste them into +a Word document, and hand-edit various detailed results. + +The key R package here is [`knitr`](https://yihui.name/knitr/). It allows you +to create a document that is a mixture of text and chunks of +code. When the document is processed by `knitr`, chunks of code will +be executed, and graphs or other results will be inserted into the final document. + +This sort of idea has been called "literate programming". + +`knitr` allows you to mix basically any type of text with code from different programming languages, but we recommend that you use `R Markdown`, which mixes Markdown +with R. [Markdown](https://www.markdownguide.org/) is a light-weight mark-up language for creating web +pages. + +## Creating an R Markdown file + +Within RStudio, click File → New File → R Markdown and +you'll get a dialog box like this: + +![](fig/New_R_Markdown.png){alt='Screenshot of the New R Markdown file dialogue box in RStudio'} + +You can stick with the default (HTML output), but give it a title. + +## Basic components of R Markdown + +The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want +to produce. In this case, we're creating an html document. + +``` +--- +title: "Initial R Markdown document" +author: "Karl Broman" +date: "April 23, 2015" +output: html_document +--- +``` + +You can delete any of those fields if you don't want them +included. The double-quotes aren't strictly _necessary_ in this case. +They're mostly needed if you want to include a colon in the title. + +RStudio creates the document with some example text to get you +started. Note below that there are chunks like + +
``{r}
+summary(cars)
+```
+
+ +These are chunks of R code that will be executed by `knitr` and replaced +by their results. More on this later. + +## Markdown + +Markdown is a system for writing web pages by marking up the text much +as you would in an email rather than writing html code. The marked-up +text gets _converted_ to html, replacing the marks with the proper +html code. + +For now, let's delete all of the stuff that's there and write a bit of +markdown. + +You make things **bold** using two asterisks, like this: `**bold**`, +and you make things _italics_ by using underscores, like this: +`_italics_`. + +You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this: + +``` +A list: + +* bold with double-asterisks +* italics with underscores +* code-type font with backticks +``` + +or like this: + +``` +A second list: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks +``` + +Each will appear as: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks + +You can use whatever method you prefer, but _be consistent_. This maintains the +readability of your code. + +You can make a numbered list by just using numbers. You can even use the +same number over and over if you want: + +``` +1. bold with double-asterisks +1. italics with underscores +1. code-type font with backticks +``` + +This will appear as: + +1. bold with double-asterisks +2. italics with underscores +3. code-type font with backticks + +You can make section headers of different sizes by initiating a line +with some number of `#` symbols: + +``` +# Title +## Main section +### Sub-section +#### Sub-sub section +``` + +You _compile_ the R Markdown document to an html webpage by clicking +the "Knit" button in the upper-left. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Create a new R Markdown document. Delete all of the R code chunks +and write a bit of Markdown (some sections, some italicized +text, and an itemized list). + +Convert the document to a webpage. + +::::::::::::::: solution + +## Solution to Challenge 1 + +In RStudio, select File > New file > R Markdown... + +Delete the placeholder text and add the following: + +``` +# Introduction + +## Background on Data + +This report uses the *gapminder* dataset, which has columns that include: + +* country +* continent +* year +* lifeExp +* pop +* gdpPercap + +## Background on Methods + +``` + +Then click the 'Knit' button on the toolbar to generate an html document (webpage). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## A bit more Markdown + +You can make a hyperlink like this: +`[Carpentries Home Page](https://carpentries.org/)`. + +You can include an image file like this: `![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)` + +You can do subscripts (e.g., F~2~) with `F~2~` and superscripts (e.g., +F^2^) with `F^2^`. + +If you know how to write equations in +[LaTeX](https://www.latex-project.org/), you can use `$ $` and `$$ $$` to insert math equations, like +`$E = mc^2$` and + +``` +$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$ +``` + +You can review Markdown syntax by navigating to the +"Markdown Quick Reference" under the "Help" field in the +toolbar at the top of RStudio. + +## R code chunks + +The real power of Markdown comes from +mixing markdown with chunks of code. This is R Markdown. When +processed, the R code will be executed; if they produce figures, the +figures will be inserted in the final document. + +The main code chunks look like this: + +
``{r load_data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +That is, you place a chunk of R code between \`\`\`{r chunk\_name} +and \`\`\`. You should give each chunk +a unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the shortcuts Ctrl\+Alt\+I on Windows and Linux, or Cmd\+Option\+I on Mac. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Add code chunks to: + +- Load the ggplot2 package +- Read the gapminder data +- Create a plot + +::::::::::::::: solution + +## Solution to Challenge 2 + +
``{r load-ggplot2}
+library("ggplot2")
+```
+
+ +
``{r read-gapminder-data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +
``{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## How things get compiled + +When you press the "Knit" button, the R Markdown document is +processed by [`knitr`](https://yihui.name/knitr) and a plain Markdown +document is produced (as well as, potentially, a set of figure files): the R code is executed +and replaced by both the input and the output; if figures are +produced, links to those figures are included. + +The Markdown and figure documents are then processed by the tool +[`pandoc`](https://pandoc.org/), which converts the Markdown file into an +html file, with the figures embedded. + +```{r rmd_to_html_fig, fig.width=8, fig.height=3, fig.align="left", echo=FALSE} +par(mar=rep(0, 4), bty="n", cex=1.5) +plot(0, 0, type="n", xlab="", ylab="", xaxt="n", yaxt="n", + xlim=c(0, 100), ylim=c(0, 100)) +xw <- 10 +yh <- 35 +xm <- 12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".Rmd") + +xm <- 50 +ym <- 80 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".md") +xm <- 50; ym <- 25 +for(i in c(2, 0, -2)) + rect(xm-xw/2+i, ym-yh/2+i, xm+xw/2+i, ym+yh/2+i, lwd=2, + border="black", col="white") +text(xm-2, ym-2, "figs/") + +xm <- 100-12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".html") + +arrows(22, 50, 38, 50, lwd=2, col="slateblue", len=0.1) +text((22+38)/2, 60, "knitr", col="darkslateblue", cex=1.3) + +arrows(62, 50, 78, 50, lwd=2, col="slateblue", len=0.1) +text((62+78)/2, 60, "pandoc", col="darkslateblue", cex=1.3) +``` + +## Chunk options + +There are a variety of options to affect how the code chunks are +treated. Here are some examples: + +- Use `echo=FALSE` to avoid having the code itself shown. +- Use `results="hide"` to avoid having any results printed. +- Use `eval=FALSE` to have the code shown but not evaluated. +- Use `warning=FALSE` and `message=FALSE` to hide any warnings or + messages produced. +- Use `fig.height` and `fig.width` to control the size of the figures + produced (in inches). + +So you might write: + +
``{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+ +Often there will be particular options that you'll want to use +repeatedly; for this, you can set _global_ chunk options, like so: + +
``{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+ +The `fig.path` option defines where the figures will be saved. The `/` +here is really important; without it, the figures would be saved in +the standard place but just with names that begin with `Figs`. + +If you have multiple R Markdown files in a common directory, you might +want to use `fig.path` to define separate prefixes for the figure file +names, like `fig.path="Figs/cleaning-"` and `fig.path="Figs/analysis-"`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use chunk options to control the size of a figure and to hide the +code. + +::::::::::::::: solution + +## Solution to Challenge 3 + +
``{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You can review all of the `R` chunk options by navigating to +the "R Markdown Cheat Sheet" under the "Cheatsheets" section +of the "Help" field in the toolbar at the top of RStudio. + +## Inline R code + +You can make _every_ number in your report reproducible. Use \`r and \` for an in-line code chunk, +like so: ` ``r "r round(some_value, 2)"`` `. The code will be +executed and replaced with the _value_ of the result. + +Don't let these in-line chunks get split across lines. + +Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with `include=FALSE` for that larger +chunk (which is the same as `echo=FALSE` and `results="hide"`). + +Rounding can produce differences in output in such situations. You may want +`2.0`, but `round(2.03, 1)` will give just `2`. + +The +[`myround`](https://github.com/kbroman/broman/blob/master/R/myround.R) +function in the [R/broman](https://github.com/kbroman/broman) package handles +this. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Try out a bit of in-line R code. + +::::::::::::::: solution + +## Solution to Challenge 4 + +Here's some inline code to determine that 2 + 2 = `r 2+2`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other output options + +You can also convert R Markdown to a PDF or a Word document. Click the +little triangle next to the "Knit" button to get a drop-down +menu. Or you could put `pdf_document` or `word_document` in the initial header +of the file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Creating PDF documents + +Creating .pdf documents may require installation of some extra software. The R +package `tinytex` provides some tools to help make this process easier for R users. +With `tinytex` installed, run `tinytex::install_tinytex()` to install the required +software (you'll only need to do this once) and then when you knit to pdf `tinytex` +will automatically detect and install any additional LaTeX packages that are needed to +produce the pdf document. Visit the [tinytex website](https://yihui.org/tinytex/) +for more information. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Visual markdown editing in RStudio + +RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like `**bold words**`) are +transformed to the formatted appearance (**bold words**) as you type. +This mode also includes a toolbar at the top with basic formatting buttons, +similar to what you might see in common word processing software programs. +You can turn visual editing on and off by pressing +the ![](fig/visual_mode_icon.png){alt='Icon for turning on and off the visual editing mode in RStudio, which looks like a pair of compasses'} +button in the top right corner of your R Markdown document. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Resources + +- [Knitr in a knutshell tutorial](https://kbroman.org/knitr_knutshell) +- [Dynamic Documents with R and knitr](https://www.amazon.com/exec/obidos/ASIN/1482203537/7210-20) (book) +- [R Markdown documentation](https://rmarkdown.rstudio.com) +- [R Markdown cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) +- [Getting started with R Markdown](https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/) +- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) (book by Rstudio team) +- [Reproducible Reporting](https://www.rstudio.com/resources/webinars/reproducible-reporting/) +- [The Ecosystem of R Markdown](https://www.rstudio.com/resources/webinars/the-ecosystem-of-r-markdown/) +- [Introducing Bookdown](https://www.rstudio.com/resources/webinars/introducing-bookdown/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Mix reporting written in R Markdown with software written in R. +- Specify chunk options to control formatting. +- Use `knitr` to convert these documents into PDF and other formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/episodes/15-wrap-up.Rmd b/locale/es/episodes/15-wrap-up.Rmd new file mode 100644 index 000000000..d9fa5b74f --- /dev/null +++ b/locale/es/episodes/15-wrap-up.Rmd @@ -0,0 +1,110 @@ +--- +title: Writing Good Software +teaching: 15 +exercises: 0 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe best practices for writing R and explain the justification for each. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write software that other people can use? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Structure your project folder + +Keep your project folder structured, organized and tidy, by creating subfolders for your code files, manuals, data, binaries, output plots, etc. It can be done completely manually, or with the help of RStudio's `New Project` functionality, or a designated package, such as `ProjectTemplate`. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: ProjectTemplate - a possible solution + +One way to automate the management of projects is to install the third-party package, `ProjectTemplate`. +This package will set up an ideal directory structure for project management. +This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. +Together with the default RStudio project functionality and Git you will be able to keep track of your +work as well as be able to share your work with collaborators. + +1. Install `ProjectTemplate`. +2. Load the library +3. Initialise the project: + +```{r, eval=FALSE} +install.packages("ProjectTemplate") +library("ProjectTemplate") +create.project("../my_project_2", merge.strategy = "allow.non.conflict") +``` + +For more information on ProjectTemplate and its functionality visit the +home page [ProjectTemplate](https://projecttemplate.net/index.html) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Make code readable + +The most important part of writing code is making it readable and understandable. +You want someone else to be able to pick up your code and be able to understand +what it does: more often than not this someone will be you 6 months down the line, +who will otherwise be cursing past-self. + +## Documentation: tell us what and why, not how + +When you first start out, your comments will often describe what a command does, +since you're still learning yourself and it can help to clarify concepts and +remind you later. However, these comments aren't particularly useful later on +when you don't remember what problem your code is trying to solve. Try to also +include comments that tell you _why_ you're solving a problem, and _what_ problem +that is. The _how_ can come after that: it's an implementation detail you ideally +shouldn't have to worry about. + +## Keep your code modular + +Our recommendation is that you should separate your functions from your analysis +scripts, and store them in a separate file that you `source` when you open the R +session in your project. This approach is nice because it leaves you with an +uncluttered analysis script, and a repository of useful functions that can be +loaded into any analysis script in your project. It also lets you group related +functions together easily. + +## Break down problem into bite size pieces + +When you first start out, problem solving and function writing can be daunting +tasks, and hard to separate from code inexperience. Try to break down your +problem into digestible chunks and worry about the implementation details later: +keep breaking down the problem into smaller and smaller functions until you +reach a point where you can code a solution, and build back up from there. + +## Know that your code is doing the right thing + +Make sure to test your functions! + +## Don't repeat yourself + +Functions enable easy reuse within a project. If you see blocks of similar +lines of code through your project, those are usually candidates for being +moved into functions. + +If your calculations are performed through a series of functions, then the +project becomes more modular and easier to change. This is especially the case +for which a particular input always gives a particular output. + +## Remember to be stylish + +Apply consistent style to your code. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Keep your project folder structured, organized and tidy. +- Document what and why, not how. +- Break programs into short single-purpose functions. +- Write re-runnable tests. +- Don't repeat yourself. +- Be consistent in naming, indentation, and other aspects of style. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/index.md b/locale/es/index.md new file mode 100644 index 000000000..c434e5efa --- /dev/null +++ b/locale/es/index.md @@ -0,0 +1,34 @@ +--- +site: sandpaper::sandpaper_site +--- + +_an introduction to R for non-programmers using gapminder data_ + +The goal of this lesson is to teach novice programmers to write modular code +and best practices for using R for data analysis. R is commonly used in many +scientific disciplines for statistical analysis and its array of third-party +packages. We find that many scientists who come to Software Carpentry workshops +use R and want to learn more. The emphasis of these materials is to give +attendees a strong foundation in the fundamentals of R, and to teach best +practices for scientific computing: breaking down analyses into modular units, +task automation, and encapsulation. + +Note that this workshop will focus on teaching the fundamentals of the +programming language R, and will not teach statistical analysis. + +The lesson contains more material than can be taught in a day. The [instructor notes page](instructors/instructor-notes.md) has some suggested lesson plans suitable for a one or half day workshop. + +A variety of third party packages are used throughout this workshop. These +are not necessarily the best, nor are they comprehensive, but they are +packages we find useful, and have been chosen primarily for their +usability. + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Understand that computers store data and instructions (programs, scripts etc.) in files. +Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/es/instructors/instructor-notes.md b/locale/es/instructors/instructor-notes.md new file mode 100644 index 000000000..43ffc4c20 --- /dev/null +++ b/locale/es/instructors/instructor-notes.md @@ -0,0 +1,132 @@ +--- +title: Instructor Notes +--- + +## Timing + +Leave about 30 minutes at the start of each workshop and another 15 mins +at the start of each session for technical difficulties like WiFi and +installing things (even if you asked students to install in advance, longer if +not). + +## Lesson Plans + +The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course. + +Some suggested paths through the material are: + +(suggested by [@liz-is](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-276529213)) + +- 01 Introduction to R and RStudio +- 04 Data Structures +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 08 Creating Publication-Quality Graphics with ggplot2 +- 10 Functions Explained +- 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +(suggested by [@naupaka](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-312547509)) + +- 01 Introduction to R and RStudio +- 02 Project Management With RStudio +- 03 Seeking Help +- 04 Data Structures +- 05 Exploring Data Frames +- 06 Subsetting Data +- 09 Vectorization +- 08 Creating Publication-Quality Graphics with ggplot2 _OR_ + 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +A half day course could consist of (suggested by [@karawoo](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-277599864)): + +- 01 Introduction to R and RStudio +- 04 Data Structures (only creating vectors with `c()`) +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 06 Subsetting Data (excluding factor, matrix and list subsetting) +- 08 Creating Publication-Quality Graphics with ggplot2 + +## Setting up git in RStudio + +There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to +the Options window in the RStudio application. + +- **Mac OS X:** + - Go RStudio -> Preferences... -> Git/SVN + - Check and see whether there is a path to a file in the "Git executable" window. If not, the next challenge is figuring out where Git is located. + - In the terminal enter `which git` and you will get a path to the git executable. In the "Git executable" window you may have difficulties finding the directory since OS X hides many of the operating system files. While the file selection window is open, pressing "Command-Shift-G" will pop up a text entry box where you will be able to type or paste in the full path to your git executable: e.g. /usr/bin/git or whatever else it might be. +- **Windows:** + - Go Tools -> Global options... -> Git/SVN + - If you use the Software Carpentry Installer, then 'git.exe' should be installed at `C:/Program Files/Git/bin/git.exe`. + +To prevent the learners from having to re-enter their password each time they push a commit to GitHub, this command (which can be run from a bash prompt) will make it so they only have to enter their password once: + +```bash +$ git config --global credential.helper 'cache --timeout=10000000' +``` + +## RStudio Color Preview + +RStudio has a feature to preview the color for certain named colors and hexadecimal colors. This may confuse or distract learners (and instructors) who are not expecting it. + +Mainly, this is likely to come up during the episode on "Data Structures" with the following code block: + +```r +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_string = c(1, 0, 1)) +``` + +This option can be turned off and on in the following menu setting: +Tools -> Global Options -> Code -> Display -> Enable preview of named and hexadecimal colors (under "Syntax") + +## Pulling in Data + +The easiest way to get the data used in this lesson during a workshop is to have +attendees download the raw data from [gapminder-data] and +[gapminder-data-wide]. + +Attendees can use the `File - Save As` dialog in their browser to save the file. + +## Overall + +Make sure to emphasize good practices: put code in scripts, and make +sure they're version controlled. Encourage students to create script +files for challenges. + +If you're working in a cloud environment, get them to upload the +gapminder data after the second lesson. + +Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a +lot of the esoteric behaviour encountered in basic operations. + +Vector recycling and function stacks are probably best explained +with diagrams on a whiteboard. + +Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is tremendously +useful. + +Be sure to show the CRAN task views, look at one of the topics. + +There's a lot of content: move quickly through the earlier lessons. Their +extensiveness is mostly for purposes of learning by osmosis: so that their +memory will trigger later when they encounter a problem or some esoteric behaviour. + +Key lessons to take time on: + +- Data subsetting - conceptually difficult for novices +- Functions - learners especially struggle with this +- Data structures - worth being thorough, but you can go through it quickly. + +Don't worry about being correct or knowing the material back-to-front. Use +mistakes as teaching moments: the most vital skill you can impart is how to +debug and recover from unexpected errors. + +[gapminder-data]: data/gapminder_data.csv +[gapminder-data-wide]: data/gapminder_wide.csv diff --git a/locale/es/learners/discuss.md b/locale/es/learners/discuss.md new file mode 100644 index 000000000..0605730b1 --- /dev/null +++ b/locale/es/learners/discuss.md @@ -0,0 +1,7 @@ +--- +title: Discussion +--- + +Please see [our other R lesson][r-gap] for a different presentation of these concepts. + +[r-gap]: https://swcarpentry.github.io/r-novice-gapminder/ diff --git a/locale/es/learners/reference.md b/locale/es/learners/reference.md new file mode 100644 index 000000000..a4c31f8db --- /dev/null +++ b/locale/es/learners/reference.md @@ -0,0 +1,342 @@ +--- +title: Reference +--- + +## Reference + +## [Introduction to R and RStudio](episodes/01-rstudio-intro.Rmd) + +- Use the escape key to cancel incomplete commands or running code + (Ctrl+C) if you're using R from the shell. +- Basic arithmetic operations follow standard order of precedence: + - Brackets: `(`, `)` + - Exponents: `^` or `**` + - Divide: `/` + - Multiply: `*` + - Add: `+` + - Subtract: `-` +- Scientific notation is available, e.g: `2e-3` +- Anything to the right of a `#` is a comment, R will ignore this! +- Functions are denoted by `function_name()`. Expressions inside the + brackets are evaluated before being passed to the function, and + functions can be nested. +- Mathematical functions: `exp`, `sin`, `log`, `log10`, `log2` etc. +- Comparison operators: `<`, `<=`, `>`, `>=`, `==`, `!=` +- Use `all.equal` to compare numbers! +- `<-` is the assignment operator. Anything to the right is evaluate, then + stored in a variable named to the left. +- `ls` lists all variables and functions you've created +- `rm` can be used to remove them +- When assigning values to function arguments, you _must_ use `=`. + +## [Project management with RStudio](episodes/02-project-intro.Rmd) + +- To create a new project, go to File -> New Project +- Install the `packrat` package to create self-contained projects +- `install.packages` to install packages from CRAN +- `library` to load a package into R +- `packrat::status` to check whether all packages referenced in your + scripts have been installed. + +## [Seeking help](episodes/03-seeking-help.Rmd) + +- To access help for a function type `?function_name` or `help(function_name)` +- Use quotes for special operators e.g. `?"+"` +- Use fuzzy search if you can't remember a name '??search\_term' +- [CRAN task views](https://cran.at.r-project.org/web/views) are a good starting point. +- [Stack Overflow](https://stackoverflow.com/) is a good place to get help with your code. + - `?dput` will dump data you are working from so others can load it easily. + - `sessionInfo()` will give details of your setup that others may need for debugging. + +## [Data structures](episodes/04-data-structures-part1.Rmd) + +Individual values in R must be one of 5 **data types**, multiple values can be grouped in **data structures**. + +**Data types** + +- `typeof(object)` gives information about an items data type. + +- There are 5 main data types: + + - `?numeric` real (decimal) numbers + - `?integer` whole numbers only + - `?character` text + - `?complex` complex numbers + - `?logical` TRUE or FALSE values + + **Special types:** + + - `?NA` missing values + - `?NaN` "not a number" for undefined values (e.g. `0/0`). + - `?Inf`, `-Inf` infinity. + - `?NULL` a data structure that doesn't exist + + `NA` can occur in any atomic vector. `NaN`, and `Inf` can only + occur in complex, integer or numeric type vectors. Atomic vectors + are the building blocks for all other data structures. A `NULL` value + will occur in place of an entire data structure (but can occur as list + elements). + +**Basic data structures in R:** + +- atomic `?vector` (can only contain one type) +- `?list` (containers for other objects) +- `?data.frame` two dimensional objects whose columns can contain different types of data +- `?matrix` two dimensional objects that can contain only one type of data. +- `?factor` vectors that contain predefined categorical data. +- `?array` multi-dimensional objects that can only contain one type of data + +Remember that matrices are really atomic vectors underneath the hood, and that +data.frames are really lists underneath the hood (this explains some of the weirder +behaviour of R). + +**[Vectors](episodes/04-data-structures-part1.Rmd)** + +- `?vector()` All items in a vector must be the same type. +- Items can be converted from one type to another using _coercion_. +- The concatenate function 'c()' will append items to a vector. +- `seq(from=0, to=1, by=1)` will create a sequence of numbers. +- Items in a vector can be named using the `names()` function. + +**[Factors](episodes/04-data-structures-part1.Rmd)** + +- `?factor()` Factors are a data structure designed to store categorical data. +- `levels()` shows the valid values that can be stored in a vector of type factor. + +**[Lists](episodes/04-data-structures-part1.Rmd)** + +- `?list()` Lists are a data structure designed to store data of different types. + +**[Matrices](episodes/04-data-structures-part1.Rmd)** + +- `?matrix()` Matrices are a data structure designed to store 2-dimensional data. + +**[Data Frames](episodes/05-data-structures-part2.Rmd)** + +- `?data.frame` is a key data structure. It is a `list` of `vectors`. +- `cbind()` will add a column (vector) to a data.frame. +- `rbind()` will add a row (list) to a data.frame. + +**Useful functions for querying data structures:** + +- `?str` structure, prints out a summary of the whole data structure +- `?typeof` tells you the type inside an atomic vector +- `?class` what is the data structure? +- `?head` print the first `n` elements (rows for two-dimensional objects) +- `?tail` print the last `n` elements (rows for two-dimensional objects) +- `?rownames`, `?colnames`, `?dimnames` retrieve or modify the row names + and column names of an object. +- `?names` retrieve or modify the names of an atomic vector or list (or + columns of a data.frame). +- `?length` get the number of elements in an atomic vector +- `?nrow`, `?ncol`, `?dim` get the dimensions of a n-dimensional object + (Won't work on atomic vectors or lists). + +## [Exploring Data Frames](episodes/05-data-structures-part2.Rmd) + +- `read.csv` to read in data in a regular structure + - `sep` argument to specify the separator + - "," for comma separated + - "\\t" for tab separated + - Other arguments: + - `header=TRUE` if there is a header row + +## [Subsetting data](episodes/06-data-subsetting.Rmd) + +- Elements can be accessed by: + + - Index + - Name + - Logical vectors + +- `[` single square brackets: + + - _extract_ single elements or _subset_ vectors + - e.g.`x[1]` extracts the first item from vector x. + - _extract_ single elements of a list. The returned value will be another `list()`. + - _extract_ columns from a data.frame + +- `[` with two arguments to: + + - _extract_ rows and/or columns of + - matrices + - data.frames + - e.g. `x[1,2]` will extract the value in row 1, column 2. + - e.g. `x[2,:]` will extract the entire second column of values. + +- `[[` double square brackets to extract items from lists. + +- `$` to access columns or list elements by name + +- negative indices skip elements + +## [Control flow](episodes/07-control-flow.Rmd) + +- Use `if` condition to start a conditional statement, `else if` condition to provide + additional tests, and `else` to provide a default +- The bodies of the branches of conditional statements must be indented. +- Use `==` to test for equality. +- `%in%` will return a `TRUE`/`FALSE` indicating if there is a match between an element and a vector. +- `X && Y` is only true if both X and Y are `TRUE`. +- `X || Y` is true if either X or Y, or both, are `TRUE`. +- Zero is considered `FALSE`; all other numbers are considered `TRUE` +- Nest loops to operate on multi-dimensional data. + +## [Creating publication quality graphics](episodes/08-plot-ggplot2.Rmd) + +- figures can be created with the grammar of graphics: + - `library(ggplot2)` + - `ggplot` to create the base figure + - `aes`thetics specify the data axes, shape, color, and data size + - `geom`etry functions specify the type of plot, e.g. `point`, `line`, `density`, `box` + - `geom`etry functions also add statistical transforms, e.g. `geom_smooth` + - `scale` functions change the mapping from data to aesthetics + - `facet` functions stratify the figure into panels + - `aes`thetics apply to individual layers, or can be set for the whole plot + inside `ggplot`. + - `theme` functions change the overall look of the plot + - order of layers matters! + - `ggsave` to save a figure. + +## [Vectorization](episodes/09-vectorization.Rmd) + +- Most functions and operations apply to each element of a vector +- `*` applies element-wise to matrices +- `%*%` for true matrix multiplication +- `any()` will return `TRUE` if any element of a vector is `TRUE` +- `all()` will return `TRUE` if _all_ elements of a vector are `TRUE` + +## [Functions explained](episodes/10-functions.Rmd) + +- `?"function"` +- Put code whose parameters change frequently in a function, then call it with + different parameter values to customize its behavior. +- The last line of a function is returned, or you can use `return` explicitly +- Any code written in the body of the function will preferably look for variables defined inside the function. +- Document Why, then What, then lastly How (if the code isn't self explanatory) + +## [Writing data](episodes/11-writing-data.Rmd) + +- `write.table` to write out objects in regular format +- set `quote=FALSE` so that text isn't wrapped in `"` marks + +## [Dataframe manipulation with dplyr](episodes/12-dplyr.Rmd) + +- `library(dplyr)` +- `?select` to extract variables by name. +- `?filter` return rows with matching conditions. +- `?group_by` group data by one of more variables. +- `?summarize` summarize multiple values to a single value. +- `?mutate` add new variables to a data.frame. +- Combine operations using the `?"%>%"` pipe operator. + +## [Dataframe manipulation with tidyr](episodes/13-tidyr.Rmd) + +- `library(tidyr)` +- `?pivot_longer` convert data from _wide_ to _long_ format. +- `?pivot_wider` convert data from _long_ to _wide_ format. +- `?separate` split a single value into multiple values. +- `?unite` merge multiple values into a single value. + +## [Producing reports with knitr](episodes/14-knitr-markdown.Rmd) + +- Value of reproducible reports +- Basics of Markdown +- R code chunks +- Chunk options +- Inline R code +- Other output formats + +## [Best practices for writing good code](episodes/15-wrap-up.Rmd) + +- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. +- Write tests before writing code in order to help determine exactly what that code is supposed to do. +- Know what code is supposed to do before trying to debug it. +- Make it fail every time. +- Make it fail fast. +- Change one thing at a time, and for a reason. +- Keep track of what you've done. +- Be humble + +## Glossary + +[argument]{#argument} +: A value given to a function or program when it runs. +The term is often used interchangeably (and inconsistently) with [parameter](#parameter). + +[assign]{#assign} +: To give a value a name by associating a variable with it. + +[body]{#body} +: (of a function): the statements that are executed when a function runs. + +[comment]{#comment} +: A remark in a program that is intended to help human readers understand what is going on, +but is ignored by the computer. +Comments in Python, R, and the Unix shell start with a `#` character and run to the end of the line; +comments in SQL start with `--`, +and other languages have other conventions. + +[comma-separated values]{#comma-separated-values} +: (CSV) A common textual representation for tables +in which the values in each row are separated by commas. + +[delimiter]{#delimiter} +: A character or characters used to separate individual values, +such as the commas between columns in a [CSV](#comma-separated-values) file. + +[documentation]{#documentation} +: Human-language text written to explain what software does, +how it works, or how to use it. + +[floating-point number]{#floating-point-number} +: A number containing a fractional part and an exponent. +See also: [integer](#integer). + +[for loop]{#for-loop} +: A loop that is executed once for each value in some kind of set, list, or range. +See also: [while loop](#while-loop). + +[index]{#index} +: A subscript that specifies the location of a single value in a collection, +such as a single pixel in an image. + +[integer]{#integer} +: A whole number, such as -12343. See also: [floating-point number](#floating-point-number). + +[library]{#library} +: In R, the directory(ies) where [packages](#package) are stored. + +[package]{#package} +: A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a [library](#library) and loaded using the library() function. + +[parameter]{#parameter} +: A variable named in the function's declaration that is used to hold a value passed into the call. +The term is often used interchangeably (and inconsistently) with [argument](#argument). + +[return statement]{#return-statement} +: A statement that causes a function to stop executing and return a value to its caller immediately. + +[sequence]{#sequence} +: A collection of information that is presented in a specific order. + +[shape]{#shape} +: An array's dimensions, represented as a vector. +For example, a 5×3 array's shape is `(5,3)`. + +[string]{#string} +: Short for "character string", +a [sequence](#sequence) of zero or more characters. + +[syntax error]{#syntax-error} +: A programming error that occurs when statements are in an order or contain characters +not expected by the programming language. + +[type]{#type} +: The classification of something in a program (for example, the contents of a variable) +as a kind of number (e.g. [floating-point number](#floating-point-number), [integer](#integer)), [string](#string), +or something else. In R the command typeof() is used to query a variables type. + +[while loop]{#while-loop} +: A loop that keeps executing as long as some condition is true. +See also: [for loop](#for-loop). diff --git a/locale/es/learners/setup.md b/locale/es/learners/setup.md new file mode 100644 index 000000000..736e10764 --- /dev/null +++ b/locale/es/learners/setup.md @@ -0,0 +1,8 @@ +--- +title: Setup +--- + +This lesson assumes you have R and RStudio installed on your computer. + +- [Download and install the latest version of R](https://www.r-project.org/). +- [Download and install RStudio](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features. You will need the free Desktop version for your computer. diff --git a/locale/es/profiles/learner-profiles.md b/locale/es/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/es/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. diff --git a/locale/it/CODE_OF_CONDUCT.md b/locale/it/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..a820b8df5 --- /dev/null +++ b/locale/it/CODE_OF_CONDUCT.md @@ -0,0 +1,12 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/locale/it/CONTRIBUTING.md b/locale/it/CONTRIBUTING.md new file mode 100644 index 000000000..d29e890c5 --- /dev/null +++ b/locale/it/CONTRIBUTING.md @@ -0,0 +1,122 @@ +## Contributing + +[The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data +Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source +projects, and we welcome contributions of all kinds: new lessons, fixes to +existing material, bug reports, and reviews of proposed changes are all +welcome. + +### Contributor Agreement + +By contributing, you agree that we may redistribute your work under our +license. In exchange, we will address your issues and/or assess +your change proposal as promptly as we can, and help you become a member of our +community. Everyone involved in [The Carpentries][cp-site] agrees to abide by +our [code of conduct](CODE_OF_CONDUCT.md). + +### How to Contribute + +The easiest way to get started is to file an issue to tell us about a spelling +mistake, some awkward wording, or a factual error. This is a good way to +introduce yourself and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, you can [send us comments by + email][contact]. However, we will be able to respond more quickly if you use + one of the other methods described below. + +2. If you have a [GitHub][github] account, or are willing to [create + one][github-join], but do not know how to use Git, you can report problems + or suggest improvements by [creating an issue][repo-issues]. This allows us + to assign the item to someone and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, and would like to add or change material, + you can submit a pull request (PR). Instructions for doing this are + [included below](#using-github). For inspiration about changes that need to + be made, check out the [list of open issues][issues] across the Carpentries. + +Note: if you want to build the website locally, please refer to [The Workbench +documentation][template-doc]. + +### Where to Contribute + +1. If you wish to change this lesson, add issues and pull requests here. +2. If you wish to change the template used for workshop websites, please refer + to [The Workbench documentation][template-doc]. + +### What to Contribute + +There are many ways to contribute, from writing new exercises and improving +existing ones to updating or filling in the documentation and submitting [bug +reports][issues] about things that do not work, are not clear, or are missing. +If you are looking for ideas, please see [the list of issues for this +repository][repo-issues], or the issues for [Data Carpentry][dc-issues], +[Library Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: we are +smarter together than we are on our own. **Reviews from novices and newcomers +are particularly valuable**: it's easy for people who have been using these +lessons for a while to forget how impenetrable some of this material can be, so +fresh eyes are always welcome. + +### What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical +workshop, so we are usually _not_ looking for more concepts or tools to add to +them. As a rule, if you want to introduce a new idea, you must (a) estimate how +long it will take to teach and (b) explain what you would take out to make room +for it. The first encourages contributors to be honest about requirements; the +second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one +platform. Our workshops typically contain a mixture of Windows, macOS, and +Linux users; in order to be usable, our lessons must run equally well on all +three. + +### Using GitHub + +If you choose to contribute via GitHub, you may want to look at [How to +Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we +use [GitHub flow][github-flow] to manage changes: + +1. Create a new branch in your desktop copy of this repository for each + significant change. +2. Commit the change in that branch. +3. Push that branch to your fork of this repository on GitHub. +4. Submit a pull request from that branch to the [upstream repository][repo]. +5. If you receive feedback, make changes on your desktop and push to your + branch on GitHub: the pull request will update automatically. + +NB: The published copy of the lesson is usually in the `main` branch. + +Each lesson has a team of maintainers who review issues and pull requests or +encourage others to do so. The maintainers are community volunteers, and have +final say over what gets merged into the lesson. + +### Other Resources + +The Carpentries is a global organisation with volunteers and learners all over +the world. We share values of inclusivity and a passion for sharing knowledge, +teaching and learning. There are several ways to connect with The Carpentries +community listed at \ including via social +media, slack, newsletters, and email lists. You can also [reach us by +email][contact]. + +[repo]: https://github.com/swcarpentry/r-novice-gapminder +[repo-issues]: https://github.com/swcarpentry/r-novice-gapminder/issues +[contact]: mailto:team@carpentries.org +[cp-site]: https://carpentries.org/ +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry +[dc-lessons]: https://datacarpentry.org/lessons/ +[dc-site]: https://datacarpentry.org/ +[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss +[github]: https://github.com +[github-flow]: https://guides.github.com/introduction/flow/ +[github-join]: https://github.com/join +[how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github +[issues]: https://carpentries.org/help-wanted-issues/ +[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry +[swc-lessons]: https://software-carpentry.org/lessons/ +[swc-site]: https://software-carpentry.org/ +[lc-site]: https://librarycarpentry.org/ +[template-doc]: https://carpentries.github.io/workbench/ diff --git a/locale/it/LICENSE.md b/locale/it/LICENSE.md new file mode 100644 index 000000000..513ad8f83 --- /dev/null +++ b/locale/it/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to \), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +- No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/locale/it/README.md b/locale/it/README.md new file mode 100644 index 000000000..19341cede --- /dev/null +++ b/locale/it/README.md @@ -0,0 +1,6 @@ +# Internationalisation hub repository for Software Carpentry R for Reproducible Scientific Analysis + +An introduction to R for non-programmers using the [Gapminder][gapminder] data. +Please see [https://swcarpentry.github.io/r-novice-gapminder](https://swcarpentry.github.io/r-novice-gapminder) for a rendered version of this material in English. + +More info to follow. diff --git a/locale/it/config.yaml b/locale/it/config.yaml new file mode 100644 index 000000000..e8310df81 --- /dev/null +++ b/locale/it/config.yaml @@ -0,0 +1,71 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'swc' +#Overall title for pages. +title: 'R for Reproducible Scientific Analysis' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2015-04-18' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/swcarpentry/r-novice-gapminder' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'team@carpentries.org' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 01-rstudio-intro.Rmd + - 02-project-intro.Rmd + - 03-seeking-help.Rmd + - 04-data-structures-part1.Rmd + - 05-data-structures-part2.Rmd + - 06-data-subsetting.Rmd + - 07-control-flow.Rmd + - 08-plot-ggplot2.Rmd + - 09-vectorization.Rmd + - 10-functions.Rmd + - 11-writing-data.Rmd + - 12-dplyr.Rmd + - 13-tidyr.Rmd + - 14-knitr-markdown.Rmd + - 15-wrap-up.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://swcarpentry.github.io/r-novice-gapminder' +analytics: carpentries +lang: en diff --git a/locale/it/episodes/01-rstudio-intro.Rmd b/locale/it/episodes/01-rstudio-intro.Rmd new file mode 100644 index 000000000..3949440a1 --- /dev/null +++ b/locale/it/episodes/01-rstudio-intro.Rmd @@ -0,0 +1,722 @@ +--- +title: Introduction to R and RStudio +teaching: 45 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose and use of each pane in RStudio +- Locate buttons and options in RStudio +- Define a variable +- Assign data to a variable +- Manage a workspace in an interactive R session +- Use mathematical and comparison operators +- Call functions +- Manage packages + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to find your way around RStudio? +- How to interact with R? +- How to manage your environment? +- How to install packages? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Before Starting The Workshop + +Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date. + +- [Download and install the latest version of R here](https://www.r-project.org/) +- [Download and install RStudio here](https://www.rstudio.com/products/rstudio/download/#download) + +## Why use R and R studio? + +Welcome to the R portion of the Software Carpentry workshop! + +Science is a multi-step process: once you've designed an experiment and collected +data, the real fun begins with analysis! Throughout this lesson, we're going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier. + +Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to ["reproducible" research](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285). + +Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program. + +However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms (including +on servers) and provides many advantages such as integration with version +control and project management. + +## Overview + +We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from [gapminder.org](https://www.gapminder.org) containing population information for many +countries through time. Can you read the data into R? Can you plot the population for +Senegal? Can you calculate the average income for countries on the continent of Asia? +By the end of these lessons you will be able to do things like plot the populations +for all of these countries in under a minute! + +**Basic layout** + +When you first open RStudio, you will be greeted by three panels: + +- The interactive R console/Terminal (entire left) +- Environment/History/Connections (tabbed in upper right) +- Files/Plots/Packages/Help/Viewer (tabbed in lower right) + +![](fig/01-rstudio.png){alt='RStudio layout'} + +Once you open files, such as R scripts, an editor panel will also open +in the top left. + +![](fig/01-rstudio-script.png){alt='RStudio layout with .R file open'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## R scripts + +Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have `.R` at the end of their names to +let you know what they are. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Workflow within RStudio + +There are two main ways one can work within RStudio: + +1. Test and play within the interactive R console then copy code into + a .R file to run later. + +- This works well when doing small tests and initially starting off. +- It quickly becomes laborious + +2. Start writing in a .R file and use RStudio's short cut keys for the Run command + to push the current line, selected lines or modified lines to the + interactive R console. + +- This is a great way to start; all your code is saved for later +- You will be able to run the file you create from within RStudio + or using R's `source()` function. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running segments of your code + +RStudio offers you great flexibility in running code from within the editor +window. There are buttons, menu choices, and keyboard shortcuts. To run the +current line, you can + +1. click on the `Run` button above the editor panel, or +2. select "Run Lines" from the "Code" menu, or +3. hit Ctrl\+Return in Windows or Linux + or \+Return on OS X. + (This shortcut can also be seen by hovering + the mouse over the button). To run a block of code, select it and then `Run`. + If you have modified a line of code within a block of code you have just run, + there is no need to reselect the section and `Run`, you can use the next button + along, `Re-run the previous region`. This will run the previous code block + including the modifications you have made. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction to R + +Much of your time in R will be spent in the R interactive +console. This is where you will run all of your code, and can be a +useful environment to try out ideas before adding them to an R script +file. This console in RStudio is the same as the one you would get if +you typed in `R` in your command-line environment. + +The first thing you will see in the R interactive session is a bunch +of information, followed by a ">" and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a "Read, evaluate, +print loop": you type in commands, R tries to execute them, and then +returns a result. + +## Using R as a calculator + +The simplest thing you could do with R is to do arithmetic: + +```{r} +1 + 100 +``` + +And R will print out the answer, with a preceding "[1]". [1] is the index of +the first element of the line being printed in the console. For more information +on indexing vectors, see [Episode 6: Subsetting Data](https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/index.html). + +If you type in an incomplete command, R will wait for you to +complete it. If you are familiar with Unix Shell's bash, you may recognize this behavior from bash. + +```r +> 1 + +``` + +```output ++ +``` + +Any time you hit return and the R session shows a "+" instead of a ">", it +means it's waiting for you to complete the command. If you want to cancel +a command you can hit Esc and RStudio will give you back the ">" prompt. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Canceling commands + +If you're using R from the command line instead of from within RStudio, +you need to use Ctrl\+C instead of Esc +to cancel the command. This applies to Mac users as well! + +Canceling a command isn't only useful for killing incomplete commands: +you can also use it to tell R to stop running code (for example if it's +taking much longer than you expect), or to get rid of the code you're +currently writing. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +When using R as a calculator, the order of operations is the same as you +would have learned back in school. + +From highest to lowest precedence: + +- Parentheses: `(`, `)` +- Exponents: `^` or `**` +- Multiply: `*` +- Divide: `/` +- Add: `+` +- Subtract: `-` + +```{r} +3 + 5 * 2 +``` + +Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend. + +```{r} +(3 + 5) * 2 +``` + +This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code. + +```{r, eval=FALSE} +(3 + (5 * (2 ^ 2))) # hard to read +3 + 5 * 2 ^ 2 # clear, if you remember the rules +3 + 5 * (2 ^ 2) # if you forget some rules, this might help +``` + +The text after each line of code is called a +"comment". Anything that follows after the hash (or octothorpe) symbol +`#` is ignored by R when it executes code. + +Really small or large numbers get a scientific notation: + +```{r} +2/10000 +``` + +Which is shorthand for "multiplied by `10^XX`". So `2e-4` +is shorthand for `2 * 10^(-4)`. + +You can write numbers in scientific notation too: + +```{r} +5e3 # Note the lack of minus here +``` + +## Mathematical functions + +R has many built in mathematical functions. To call a function, +we can type its name, followed by open and closing parentheses. +Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example: + +```{r, eval=FALSE} +getwd() #returns an absolute filepath +``` + +doesn't require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result. + +```{r} +sin(1) # trigonometry functions +``` + +```{r} +log(1) # natural logarithm +``` + +```{r} +log10(10) # base-10 logarithm +``` + +```{r} +exp(0.5) # e^(1/2) +``` + +Don't worry about trying to remember every function in R. You +can look them up on Google, or if you can remember the +start of the function's name, use the tab completion in RStudio. + +This is one advantage that RStudio has over R on its own, it +has auto-completion abilities that allow you to more easily +look up functions, their arguments, and the values that they +take. + +Typing a `?` before the name of a command will open the help page +for that command. When using RStudio, this will open the 'Help' pane; +if using R in the terminal, the help page will open in your browser. +The help page will include a detailed description of the command and +how it works. Scrolling to the bottom of the help page will usually +show a collection of code examples which illustrate command usage. +We'll go through an example later. + +## Comparing things + +We can also do comparisons in R: + +```{r} +1 == 1 # equality (note two equals signs, read as "is equal to") +``` + +```{r} +1 != 2 # inequality (read as "is not equal to") +``` + +```{r} +1 < 2 # less than +``` + +```{r} +1 <= 1 # less than or equal to +``` + +```{r} +1 > 0 # greater than +``` + +```{r} +1 >= -9 # greater than or equal to +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Comparing Numbers + +A word of warning about comparing numbers: you should +never use `==` to compare two numbers unless they are +integers (a data type which can specifically represent +only whole numbers). + +Computers may only represent decimal numbers with a +certain degree of precision, so two numbers which look +the same when printed out by R, may actually have +different underlying representations and therefore be +different by a small margin of error (called Machine +numeric tolerance). + +Instead you should use the `all.equal` function. + +Further reading: [http://floating-point-gui.de/](https://floating-point-gui.de/) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Variables and assignment + +We can store values in variables using the assignment operator `<-`, like this: + +```{r} +x <- 1/40 +``` + +Notice that assignment does not print a value. Instead, we stored it for later +in something called a **variable**. `x` now contains the **value** `0.025`: + +```{r} +x +``` + +More precisely, the stored value is a _decimal approximation_ of +this fraction called a [floating point number](https://en.wikipedia.org/wiki/Floating_point). + +Look for the `Environment` tab in the top right panel of RStudio, and you will see that `x` and its value +have appeared. Our variable `x` can be used in place of a number in any calculation that expects a number: + +```{r} +log(x) +``` + +Notice also that variables can be reassigned: + +```{r} +x <- 100 +``` + +`x` used to contain the value 0.025 and now it has the value 100. + +Assignment values can contain the variable being assigned to: + +```{r} +x <- x + 1 #notice how RStudio updates its description of x on the top right tab +y <- x * 2 +``` + +The right hand side of the assignment can be any valid R expression. +The right hand side is _fully evaluated_ before the assignment occurs. + +Variable names can contain letters, numbers, underscores and periods but no spaces. They +must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). +Variables beginning with a period are hidden variables. +Different people use different conventions for long variable names, these include + +- periods.between.words +- underscores\_between\_words +- camelCaseToSeparateWords + +What you use is up to you, but **be consistent**. + +It is also possible to use the `=` operator for assignment: + +```{r} +x = 1/40 +``` + +But this is much less common among R users. The most important thing is to +**be consistent** with the operator you use. There are occasionally places +where it is less confusing to use `<-` than `=`, and it is the most common +symbol used in the community. So the recommendation is to use `<-`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Which of the following are valid R variable names? + +```{r, eval=FALSE} +min_height +max.height +_age +.mass +MaxLength +min-length +2widths +celsius2kelvin +``` + +::::::::::::::: solution + +## Solution to challenge 1 + +The following can be used as R variables: + +```{r ch1pt1-sol, eval=FALSE} +min_height +max.height +MaxLength +celsius2kelvin +``` + +The following creates a hidden variable: + +```{r ch1pt2-sol, eval=FALSE} +.mass +``` + +The following will not be able to be used to create a variable + +```{r ch1pt3-sol, eval=FALSE} +_age +min-length +2widths +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Vectorization + +One final thing to be aware of is that R is _vectorized_, meaning that +variables and functions can have vectors as values. In contrast to physics and +mathematics, a vector in R describes a set of values in a certain order of the +same data type. For example: + +```{r} +1:5 +2^(1:5) +x <- 1:5 +2^x +``` + +This is incredibly powerful; we will discuss this further in an +upcoming lesson. + +## Managing your environment + +There are a few useful commands you can use to interact with the R session. + +`ls` will list all of the variables and functions stored in the global environment +(your working R session): + +```{r, eval=FALSE} +ls() +``` + +```{r, echo=FALSE} +# If `ls()` is left to run by itself when rendering this Rmd document (as would +# happen if the code chunk above was evaluated), the output would contain extra +# items ("args", "dest_md", "op", "src_md") that people following the lesson +# would not see in their own session. +# +# This probably comes from the way the md episodes are generated when the +# lesson website is built. The solution below uses a temporary environment to +# mimick what the learners should observe when running `ls()` on their +# machines. + +temp.env <- new.env() +temp.env$x <- x +temp.env$y <- y +ls(temp.env) +rm(temp.env) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: hidden objects + +Like in the shell, `ls` will hide any variables or functions starting +with a "." by default. To list all objects, type `ls(all.names=TRUE)` +instead + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Note here that we didn't give any arguments to `ls`, but we still +needed to give the parentheses to tell R to call the function. + +If we type `ls` by itself, R prints a bunch of code instead of a listing of objects. + +```{r} +ls +``` + +What's going on here? + +Like everything in R, `ls` is the name of an object, and entering the name of +an object by itself prints the contents of the object. The object `x` that we +created earlier contains `r x`: + +```{r} +x +``` + +The object `ls` contains the R code that makes the `ls` function work! We'll talk +more about how functions work and start writing our own later. + +You can use `rm` to delete objects you no longer need: + +```{r, eval=FALSE} +rm(x) +``` + +If you have lots of things in your environment and want to delete all of them, +you can pass the results of `ls` to the `rm` function: + +```{r, eval=FALSE} +rm(list = ls()) +``` + +In this case we've combined the two. Like the order of operations, anything +inside the innermost parentheses is evaluated first, and so on. + +In this case we've specified that the results of `ls` should be used for the +`list` argument in `rm`. When assigning values to arguments by name, you _must_ +use the `=` operator!! + +If instead we use `<-`, there will be unintended side effects, or you may get an error message: + +```{r, error=TRUE} +rm(list <- ls()) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Warnings vs. Errors + +Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn't worked as expected. + +In both cases, the message that R prints out usually give you clues +how to fix a problem. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## R Packages + +It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages: + +- You can see what packages are installed by typing + `installed.packages()` +- You can install packages by typing `install.packages("packagename")`, + where `packagename` is the package name, in quotes. +- You can update installed packages by typing `update.packages()` +- You can remove a package with `remove.packages("packagename")` +- You can make a package available for use with `library(packagename)` + +Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package. + +Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +What will be the value of each variable after each +statement in the following program? + +```{r, eval=FALSE} +mass <- 47.5 +age <- 122 +mass <- mass * 2.3 +age <- age - 20 +``` + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r ch2pt1-sol} +mass <- 47.5 +``` + +This will give a value of `r mass` for the variable mass + +```{r ch2pt2-sol} +age <- 122 +``` + +This will give a value of `r age` for the variable age + +```{r ch2pt3-sol} +mass <- mass * 2.3 +``` + +This will multiply the existing value of `r mass/2.3` by 2.3 to give a new value of +`r mass` to the variable mass. + +```{r ch2pt4-sol} +age <- age - 20 +``` + +This will subtract 20 from the existing value of `r age + 20 ` to give a new value +of `r age` to the variable age. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age? + +::::::::::::::: solution + +## Solution to challenge 3 + +One way of answering this question in R is to use the `>` to set up the following: + +```{r ch3-sol} +mass > age +``` + +This should yield a boolean value of TRUE since `r mass` is greater than `r age`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Clean up your working environment by deleting the mass and age +variables. + +::::::::::::::: solution + +## Solution to challenge 4 + +We can use the `rm` command to accomplish this task + +```{r ch4-sol} +rm(age, mass) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Install the following packages: `ggplot2`, `plyr`, `gapminder` + +::::::::::::::: solution + +## Solution to challenge 5 + +We can use the `install.packages()` command to install the required packages. + +```{r ch5-sol, eval=FALSE} +install.packages("ggplot2") +install.packages("plyr") +install.packages("gapminder") +``` + +An alternate solution, to install multiple packages with a single `install.packages()` command is: + +```{r ch5-sol2, eval=FALSE} +install.packages(c("ggplot2", "plyr", "gapminder")) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +When installing ggplot2, it may be required for some users to use the dependencies flag as a result of lazy loading affecting the install. This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop: + +```{r ch5-sol3, eval=FALSE} +install.packages("ggplot2", dependencies = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to write and run R programs. +- R has the usual arithmetic operators and mathematical functions. +- Use `<-` to assign values to variables. +- Use `ls()` to list the variables in a program. +- Use `rm()` to delete objects in a program. +- Use `install.packages()` to install packages (libraries). + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/02-project-intro.Rmd b/locale/it/episodes/02-project-intro.Rmd new file mode 100644 index 000000000..74f964f40 --- /dev/null +++ b/locale/it/episodes/02-project-intro.Rmd @@ -0,0 +1,259 @@ +--- +title: Project Management With RStudio +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Create self-contained projects in RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manage my projects in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Introduction + +The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and +eventually everything is a bit mixed together. + + + + +Most people tend to organize their projects like this: + +![](fig/bad_layout.png){alt='Screenshot of file manager demonstrating bad project organisation'} + +There are many reasons why we should _ALWAYS_ avoid this: + +1. It is really hard to tell which version of your data is + the original and which is the modified; +2. It gets really messy because it mixes files with various + extensions together; +3. It probably takes you a lot of time to actually find + things, and relate the correct figures to the exact code + that has been used to generate it; + +A good project layout will ultimately make your life easier: + +- It will help ensure the integrity of your data; +- It makes it simpler to share your code with someone else + (a lab-mate, collaborator, or supervisor); +- It allows you to easily upload your code with your manuscript submission; +- It makes it easier to pick the project back up after a break. + +## A possible solution + +Fortunately, there are tools and packages which can help you manage your work effectively. + +One of the most powerful and useful aspects of RStudio is its project management +functionality. We'll be using this today to create a self-contained, reproducible +project. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1: Creating a self-contained project + +We're going to create a new project in RStudio: + +1. Click the "File" menu button, then "New Project". +2. Click "New Directory". +3. Click "New Project". +4. Type in the name of the directory to store your project, e.g. "my\_project". +5. If available, select the checkbox for "Create a git repository." +6. Click the "Create Project" button. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The simplest way to open an RStudio project once it has been created is to click +through your file system to get to the directory where it was saved and double +click on the `.Rproj` file. This will open RStudio and start your R session in the +same directory as the `.Rproj` file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added benefit of +allowing you to open multiple projects at the same time each open to its own +project directory. This allows you to keep multiple projects open without them +interfering with each other. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2: Opening an RStudio project through the file system + +1. Exit RStudio. +2. Navigate to the directory where you created a project in Challenge 1. +3. Double click on the `.Rproj` file in that directory. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Best practices for project organization + +Although there is no "best" way to lay out a project, there are some general +principles to adhere to that will make project management easier: + +### Treat data as read only + +This is probably the most important goal of setting up a project. Data is +typically time consuming and/or expensive to collect. Working with them +interactively (e.g., in Excel) where they can be modified means you are never +sure of where the data came from, or how it has been modified since collection. +It is therefore a good idea to treat your data as "read-only". + +### Data Cleaning + +In many cases your data will be "dirty": it will need significant preprocessing +to get into a format R (or any other programming language) will find useful. +This task is sometimes called "data munging". Storing these scripts in a +separate folder, and creating a second "read-only" data folder to hold the +"cleaned" data sets can prevent confusion between the two sets. + +### Treat generated output as disposable + +Anything generated by your scripts should be treated as disposable: it should +all be able to be regenerated from your scripts. + +There are lots of different ways to manage this output. Having an output folder +with different sub-directories for each separate analysis makes it easier later. +Since many analyses are exploratory and don't end up being used in the final +project, and some of the analyses get shared between projects. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Good Enough Practices for Scientific Computing + +[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization: + +1. Put each project in its own directory, which is named after the project. +2. Put text documents associated with the project in the `doc` directory. +3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory. +4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory. +5. Name all files to reflect their content or function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Separate function definition and application + +One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console. + +When your project is in its early stages, the initial .R script file usually contains many lines +of directly executed code. As it matures, reusable chunks get pulled into their +own functions. It's a good idea to separate these functions into two separate folders; one +to store useful functions that you'll reuse across analyses and projects, and +one to store the analysis scripts. + +### Save the data in the data directory + +Now we have a good directory structure we will now place/save the data file in the `data/` directory. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Download the gapminder data from [this link to a csv file](data/gapminder_data.csv). + +1. Download the file (right mouse click on the link above -> "Save link as" / "Save file as", or click on the link and after the page loads, press Ctrl\+S or choose File -> "Save page as") +2. Make sure it's saved under the name `gapminder_data.csv` +3. Save the file in the `data/` folder within your project. + +We will load and inspect these data later. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +It is useful to get some general idea about the dataset, directly from the +command line, before loading it into R. Understanding the dataset better +will come in handy when making decisions on how to load it in R. Use the command-line +shell to answer the following questions: + +1. What is the size of the file? +2. How many rows of data does it contain? +3. What kinds of values are stored in this file? + +::::::::::::::: solution + +## Solution to Challenge 4 + +By running these commands in the shell: + +```{r ch2a-sol, engine="sh"} +ls -lh data/gapminder_data.csv +``` + +The file size is 80K. + +```{r ch2b-sol, engine="sh"} +wc -l data/gapminder_data.csv +``` + +There are 1705 lines. The data looks like: + +```{r ch2c-sol, engine="sh"} +head data/gapminder_data.csv +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: command line in RStudio + +The Terminal tab in the console pane provides a convenient place directly +within RStudio to interact directly with the command line. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Working directory + +Knowing R's current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory. + +Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing `.Rproj` file, it will open that project and set R's working directory to the folder that file is in. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +You can check the current working directory with the `getwd()` command, or by using the menus in RStudio. + +1. In the console, type `getwd()` ("wd" is short for "working directory") and hit Enter. +2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click "More" and then select "Go To Working Directory". + +You can change the working directory with `setwd()`, or by using RStudio menus. + +1. In the console, type `setwd("data")` and hit Enter. Type `getwd()` and hit Enter to see the new working directory. +2. In the menus at the top of the RStudio window, click the "Session" menu button, and then select "Set Working Directory" and then "Choose Directory". Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: File does not exist errors + +When you're attempting to reference a file in your R code and you're getting errors saying the file doesn't exist, it's a good idea to check your working directory. +You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Version Control + +It is important to use version control with projects. Go [here for a good lesson which describes using Git with RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html). + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to create and manage projects with consistent layout. +- Treat raw data as read-only. +- Treat generated output as disposable. +- Separate function definition and application. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/03-seeking-help.Rmd b/locale/it/episodes/03-seeking-help.Rmd new file mode 100644 index 000000000..cc2e3f7b8 --- /dev/null +++ b/locale/it/episodes/03-seeking-help.Rmd @@ -0,0 +1,267 @@ +--- +title: Seeking Help +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to read R help files for functions and special operators. +- To be able to use CRAN task views to identify packages to solve a problem. +- To be able to seek help from your peers. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I get help in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Reading Help Files + +R, and every package, provide help files for functions. The general syntax to search for help on any +function, "function\_name", from a specific function that is in a package loaded into your +namespace (your interactive R session) is: + +```{r, eval=FALSE} +?function_name +help(function_name) +``` + +For example take a look at the help file for `write.table()`, we will be using a similar function in an upcoming episode. + +```{r, eval=FALSE} +?write.table() +``` + +This will load up a help page in RStudio (or as plain text in R itself). + +Each help page is broken down into sections: + +- Description: An extended description of what the function does. +- Usage: The arguments of the function and their default values (which can be changed). +- Arguments: An explanation of the data each argument is expecting. +- Details: Any important details to be aware of. +- Value: The data the function returns. +- See Also: Any related functions you might find useful. +- Examples: Some examples for how to use the function. + +Different functions might have different sections, but these are the main ones you should be aware of. + +Notice how related functions might call for the same help file: + +```{r, eval=FALSE} +?write.table() +?write.csv() +``` + +This is because these functions have very similar applicability and often share the same arguments as inputs to the function, so package authors often choose to document them together in a single help file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running Examples + +From within the function help page, you can highlight code in the +Examples and hit Ctrl\+Return to run it in +RStudio console. This gives you a quick way to get a feel for +how a function works. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Reading Help Files + +One of the most daunting aspects of R is the large number of functions +available. It would be prohibitive, if not impossible to remember the +correct usage for every function you use. Luckily, using the help files +means you don't have to remember that! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Special Operators + +To seek help on special operators, use quotes or backticks: + +```{r, eval=FALSE} +?"<-" +?`<-` +``` + +## Getting Help with Packages + +Many packages come with "vignettes": tutorials and extended example documentation. +Without any arguments, `vignette()` will list all vignettes for all installed packages; +`vignette(package="package-name")` will list all available vignettes for +`package-name`, and `vignette("vignette-name")` will open the specified vignette. + +If a package doesn't have any vignettes, you can usually find help by typing +`help("package-name")`. + +RStudio also has a set of excellent +[cheatsheets](https://rstudio.com/resources/cheatsheets/) for many packages. + +## When You Remember Part of the Function Name + +If you're not sure what package a function is in or how it's specifically spelled, you can do a fuzzy search: + +```{r, eval=FALSE} +??function_name +``` + +A fuzzy search is when you search for an approximate string match. For example, you may remember that the function +to set your working directory includes "set" in its name. You can do a fuzzy search to help you identify the function: + +```{r, eval=FALSE} +??set +``` + +## When You Have No Idea Where to Begin + +If you don't know what function or package you need to use +[CRAN Task Views](https://cran.at.r-project.org/web/views) +is a specially maintained list of packages grouped into +fields. This can be a good starting point. + +## When Your Code Doesn't Work: Seeking Help from Your Peers + +If you're having trouble using a function, 9 times out of 10, +the answers you seek have already been answered on +[Stack Overflow](https://stackoverflow.com/). You can search using +the `[r]` tag. Please make sure to see their page on +[how to ask a good question.](https://stackoverflow.com/help/how-to-ask) + +If you can't find the answer, there are a few useful functions to +help you ask your peers: + +```{r, eval=FALSE} +?dput +``` + +Will dump the data you're working with into a format that can +be copied and pasted by others into their own R session. + +```{r} +sessionInfo() +``` + +Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Look at the help page for the `c` function. What kind of vector do you +expect will be created if you evaluate the following: + +```{r, eval=FALSE} +c(1, 2, 3) +c('d', 'e', 'f') +c(1, 2, 'f') +``` + +::::::::::::::: solution + +## Solution to Challenge 1 + +The `c()` function creates a vector, in which all elements are of the +same type. In the first case, the elements are numeric, in the +second, they are characters, and in the third they are also characters: +the numeric values are "coerced" to be characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Look at the help for the `paste` function. You will need to use it later. +What's the difference between the `sep` and `collapse` arguments? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To look at the help for the `paste()` function, use: + +```{r, eval=FALSE} +help("paste") +?paste +``` + +The difference between `sep` and `collapse` is a little +tricky. The `paste` function accepts any number of arguments, each of which +can be a vector of any length. The `sep` argument specifies the string +used between concatenated terms — by default, a space. The result is a +vector as long as the longest argument supplied to `paste`. In contrast, +`collapse` specifies that after concatenation the elements are _collapsed_ +together using the given separator, the result being a single string. + +It is important to call the arguments explicitly by typing out the argument +name e.g `sep = ","` so the function understands to use the "," as a +separator and not a term to concatenate. +e.g. + +```{r} +paste(c("a","b"), "c") +paste(c("a","b"), "c", ",") +paste(c("a","b"), "c", sep = ",") +paste(c("a","b"), "c", collapse = "|") +paste(c("a","b"), "c", sep = ",", collapse = "|") +``` + +(For more information, +scroll to the bottom of the `?paste` help page and look at the +examples, or try `example('paste')`.) + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use help to find a function (and its associated parameters) that you could +use to load data from a tabular file in which columns are delimited with "\\t" +(tab) and the decimal point is a "." (period). This check for decimal +separator is important, especially if you are working with international +colleagues, because different countries have different conventions for the +decimal point (i.e. comma vs period). +Hint: use `??"read table"` to look up functions related to reading in tabular data. + +::::::::::::::: solution + +## Solution to Challenge 3 + +The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +`read.table(file, sep="\t")` (the period is the _default_ decimal +separator for `read.table()`), although you may have to change +the `comment.char` argument as well if your data file contains +hash (#) characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other Resources + +- [Quick R](https://www.statmethods.net/) +- [RStudio cheat sheets](https://www.rstudio.com/resources/cheatsheets/) +- [Cookbook for R](https://www.cookbook-r.com/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `help()` to get online help in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/04-data-structures-part1.Rmd b/locale/it/episodes/04-data-structures-part1.Rmd new file mode 100644 index 000000000..b11c2a52c --- /dev/null +++ b/locale/it/episodes/04-data-structures-part1.Rmd @@ -0,0 +1,1101 @@ +--- +title: Data Structures +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to identify the 5 main data types. +- To begin exploring data frames, and understand how they are related to vectors and lists. +- To be able to ask questions from R about the type, class, and structure of an object. +- To understand the information of the attributes "names", "class", and "dim". + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I read data in R? +- What are the basic data types in R? +- How do I represent categorical information in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +options(stringsAsFactors = FALSE) +cats_orig <- data.frame(coat = c("calico", "black", "tabby"), weight = c(2.1, 5, 3.2), likes_catnip = c(1, 0, 1), stringsAsFactors = FALSE) +cats_bad <- data.frame(coat = c("calico", "black", "tabby", "tabby"), weight = c(2.1, 5, 3.2, "2.3 or 2.4"), likes_catnip = c(1, 0, 1, 1), stringsAsFactors = FALSE) +cats <- cats_orig +``` + +One of R's most powerful features is its ability to deal with tabular data - +such as you may already have in a spreadsheet or a CSV file. Let's start by +making a toy dataset in your `data/` directory, called `feline-data.csv`: + +```{r} +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_catnip = c(1, 0, 1)) +``` + +We can now save `cats` as a CSV file. It is good practice to call the argument +names explicitly so the function knows what default values you are changing. Here we +are setting `row.names = FALSE`. Recall you can use `?write.csv` to pull +up the help file to check out the argument names and their default values. + +```{r} +write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE) +``` + +The contents of the new file, `feline-data.csv`: + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Editing Text files in R + +Alternatively, you can create `data/feline-data.csv` using a text editor (Nano), +or within RStudio with the **File -> New File -> Text File** menu item. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can load this into R via the following: + +```{r} +cats <- read.csv(file = "data/feline-data.csv") +cats +``` + +The `read.table` function is used for reading in tabular data stored in a text +file where the columns of data are separated by punctuation characters such as +CSV files (csv = comma-separated values). Tabs and commas are the most common +punctuation characters used to separate or delimit data points in csv files. +For convenience R provides 2 other versions of `read.table`. These are: `read.csv` +for files where the data are separated with commas and `read.delim` for files +where the data are separated with tabs. Of these three functions `read.csv` is +the most commonly used. If needed it is possible to override the default +delimiting punctuation marks for both `read.csv` and `read.delim`. + +::::::::::::::::::::::::::::::::::::::::: callout + +### Check your data for factors + +In recent times, the default way how R handles textual data has changed. Text +data was interpreted by R automatically into a format called "factors". But +there is an easier format that is called "character". We will hear about +factors later, and what to use them for. For now, remember that in most cases, +they are not needed and only complicate your life, which is why newer R +versions read in text as "character". Check now if your version of R has +automatically created factors and convert them to "character" format: + +1. Check the data types of your input by typing `str(cats)` +2. In the output, look at the three-letter codes after the colons: If you see + only "num" and "chr", you can continue with the lesson and skip this box. + If you find "fct", continue to step 3. +3. Prevent R from automatically creating "factor" data. That can be done by + the following code: `options(stringsAsFactors = FALSE)`. Then, re-read + the cats table for the change to take effect. +4. You must set this option every time you restart R. To not forget this, + include it in your analysis script before you read in any data, for example + in one of the first lines. +5. For R versions greater than 4.0.0, text data is no longer converted to + factors anymore. So you can install this or a newer version to avoid this + problem. If you are working on an institute or company computer, ask your + administrator to do it. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can begin exploring our dataset right away, pulling out columns by specifying +them using the `$` operator: + +```{r} +cats$weight +cats$coat +``` + +We can do other operations on the columns: + +```{r} +## Say we discovered that the scale weighs two Kg light: +cats$weight + 2 +paste("My cat is", cats$coat) +``` + +But what about + +```{r} +cats$weight + cats$coat +``` + +Understanding what happened here is key to successfully analyzing data in R. + +### Data Types + +If you guessed that the last command will return an error because `2.1` plus +`"black"` is nonsense, you're right - and you already have some intuition for an +important concept in programming called _data types_. We can ask what type of +data something is: + +```{r} +typeof(cats$weight) +``` + +There are 5 main types: `double`, `integer`, `complex`, `logical` and `character`. +For historic reasons, `double` is also called `numeric`. + +```{r} +typeof(3.14) +typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers +typeof(1+1i) +typeof(TRUE) +typeof('banana') +``` + +No matter how +complicated our analyses become, all data in R is interpreted as one of these +basic data types. This strictness has some really important consequences. + +A user has added details of another cat. This information is in the file +`data/feline-data_v2.csv`. + +```{r, eval=FALSE} +file.show("data/feline-data_v2.csv") +``` + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +tabby,2.3 or 2.4,1 +``` + +Load the new cats data like before, and check what type of data we find in the +`weight` column: + +```{r} +cats <- read.csv(file="data/feline-data_v2.csv") +typeof(cats$weight) +``` + +Oh no, our weights aren't the double type anymore! If we try to do the same math +we did on them before, we run into trouble: + +```{r} +cats$weight + 2 +``` + +What happened? +The `cats` data we are working with is something called a _data frame_. Data frames +are one of the most common and versatile types of _data structures_ we will work with in R. +A given column in a data frame cannot be composed of different data types. +In this case, R does not read everything in the data frame column `weight` as a _double_, therefore the entire +column data type changes to something that is suitable for everything in the column. + +When R reads a csv file, it reads it in as a _data frame_. Thus, when we loaded the `cats` +csv file, it is stored as a data frame. We can recognize data frames by the first row that +is written by the `str()` function: + +```{r} +str(cats) +``` + +_Data frames_ are composed of rows and columns, where each column has the +same number of rows. Different columns in a data frame can be made up of different +data types (this is what makes them so versatile), but everything in a given +column needs to be the same type (e.g., vector, factor, or list). + +Let's explore more about different data structures and how they behave. +For now, let's remove that extra line from our cats data and reload it, +while we investigate this behavior further: + +feline-data.csv: + +``` +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +And back in RStudio: + +```{r, eval=FALSE} +cats <- read.csv(file="data/feline-data.csv") +``` + +```{r, include=FALSE} +cats <- cats_orig +``` + +### Vectors and Type Coercion + +To better understand this behavior, let's meet another of the data structures: +the _vector_. + +```{r} +my_vector <- vector(length = 3) +my_vector +``` + +A vector in R is essentially an ordered list of things, with the special +condition that _everything in the vector must be the same basic data type_. If +you don't choose the datatype, it'll default to `logical`; or, you can declare +an empty vector of whatever type you like. + +```{r} +another_vector <- vector(mode='character', length=3) +another_vector +``` + +You can check if something is a vector: + +```{r} +str(another_vector) +``` + +The somewhat cryptic output from this command indicates the basic data type +found in this vector - in this case `chr`, character; an indication of the +number of things in the vector - actually, the indexes of the vector, in this +case `[1:3]`; and a few examples of what's actually in the vector - in this case +empty character strings. If we similarly do + +```{r} +str(cats$weight) +``` + +we see that `cats$weight` is a vector, too - _the columns of data we load into R +data.frames are all vectors_, and that's the root of why R forces everything in +a column to be the same basic data type. + +:::::::::::::::::::::::::::::::::::::: discussion + +### Discussion 1 + +Why is R so opinionated about what we put in our columns of data? +How does this help us? + +::::::::::::::: solution + +### Discussion 1 + +By keeping everything in a column the same, we allow ourselves to make simple +assumptions about our data; if you can interpret one entry in the column as a +number, then you can interpret _all_ of them as numbers, so we don't have to +check every time. This consistency is what people mean when they talk about +_clean data_; in the long run, strict consistency goes a long way to making +our lives easier in R. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Coercion by combining vectors + +You can also make vectors with explicit contents with the combine function: + +```{r} +combine_vector <- c(2,6,3) +combine_vector +``` + +Given what we've learned so far, what do you think the following will produce? + +```{r} +quiz_vector <- c(2,6,'3') +``` + +This is something called _type coercion_, and it is the source of many surprises +and the reason why we need to be aware of the basic data types and how R will +interpret them. When R encounters a mix of types (here double and character) to +be combined into a single vector, it will force them all to be the same +type. Consider: + +```{r} +coercion_vector <- c('a', TRUE) +coercion_vector +another_coercion_vector <- c(0, TRUE) +another_coercion_vector +``` + +#### The type hierarchy + +The coercion rules go: `logical` -> `integer` -> `double` ("`numeric`") -> +`complex` -> `character`, where -> can be read as _are transformed into_. For +example, combining `logical` and `character` transforms the result to +`character`: + +```{r} +c('a', TRUE) +``` + +A quick way to recognize `character` vectors is by the quotes that enclose them +when they are printed. + +You can try to force +coercion against this flow using the `as.` functions: + +```{r} +character_vector_example <- c('0','2','4') +character_vector_example +character_coerced_to_double <- as.double(character_vector_example) +character_coerced_to_double +double_coerced_to_logical <- as.logical(character_coerced_to_double) +double_coerced_to_logical +``` + +As you can see, some surprising things can happen when R forces one basic data +type into another! Nitty-gritty of type coercion aside, the point is: if your +data doesn't look like what you thought it was going to look like, type coercion +may well be to blame; make sure everything is the same type in your vectors and +your columns of data.frames, or you will get nasty surprises! + +But coercion can also be very useful! For example, in our `cats` data +`likes_catnip` is numeric, but we know that the 1s and 0s actually represent +`TRUE` and `FALSE` (a common way of representing them). We should use the +`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is +exactly what our data represents. We can 'coerce' this column to be `logical` by +using the `as.logical` function: + +```{r} +cats$likes_catnip +cats$likes_catnip <- as.logical(cats$likes_catnip) +cats$likes_catnip +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 1 + +An important part of every data analysis is cleaning the input data. If you +know that the input data is all of the same format, (e.g. numbers), your +analysis is much easier! Clean the cat data set from the chapter about +type coercion. + +#### Copy the code template + +Create a new script in RStudio and copy and paste the following code. Then +move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_). + +``` +# Read data +cats <- read.csv("data/feline-data_v2.csv") + +# 1. Print the data +_____ + +# 2. Show an overview of the table with all data types +_____(cats) + +# 3. The "weight" column has the incorrect data type __________. +# The correct data type is: ____________. + +# 4. Correct the 4th weight data point with the mean of the two given values +cats$weight[4] <- 2.35 +# print the data again to see the effect +cats + +# 5. Convert the weight to the right data type +cats$weight <- ______________(cats$weight) + +# Calculate the mean to test yourself +mean(cats$weight) + +# If you see the correct mean value (and not NA), you did the exercise +# correctly! +``` + +### Instructions for the tasks + +#### 1\. Print the data + +Execute the first statement (`read.csv(...)`). Then print the data to the +console + +::::::::::::::: solution + +### Tip 1.1 + +Show the content of any variable by typing its name. + +### Solution to Challenge 1.1 + +Two correct solutions: + +``` +cats +print(cats) +``` + +::::::::::::::::::::::::: + +#### 2\. Overview of the data types + +The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of the +`cats` table. + +::::::::::::::: solution + +### Tip 1.2 + +In the chapter "Data types" we saw two functions that can show data types. +One printed just a single word, the data type name. The other printed +a short form of the data type, and the first few values. We need the second +here. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.2 +> +> ``` +> str(cats) +> ``` + +#### 3\. Which data type do we need? + +The shown data type is not the right one for this data (weight of +a cat). Which data type do we need? + +- Why did the `read.csv()` function not choose the correct data type? +- Fill in the gap in the comment with the correct data type for cat weight! + +::::::::::::::: solution + +### Tip 1.3 + +Scroll up to the section about the [type hierarchy](#the-type-hierarchy) +to review the available data types + +::::::::::::::::::::::::: + +::::::::::::::: solution + +### Solution to Challenge 1.3 + +- Weight is expressed on a continuous scale (real numbers). The R + data type for this is "double" (also known as "numeric"). +- The fourth row has the value "2.3 or 2.4". That is not a number + but two, and an english word. Therefore, the "character" data type + is chosen. The whole column is now text, because all values in the same + columns have to be the same data type. + +::::::::::::::::::::::::: + +#### 4\. Correct the problematic value + +The code to assign a new weight value to the problematic fourth row is given. +Think first and then execute it: What will be the data type after assigning +a number like in this example? +You can check the data type after executing to see if you were right. + +::::::::::::::: solution + +### Tip 1.4 + +Revisit the hierarchy of data types when two different data types are +combined. + +::::::::::::::::::::::::: + +> ### Solution to challenge 1.4 +> +> The data type of the column "weight" is "character". The assigned data +> type is "double". Combining two data types yields the data type that is +> higher in the following hierarchy: +> +> ``` +> logical < integer < double < complex < character +> ``` +> +> Therefore, the column is still of type character! We need to manually +> convert it to "double". +> {: .solution} + +#### 5\. Convert the column "weight" to the correct data type + +Cat weight are numbers. But the column does not have this data type yet. +Coerce the column to floating point numbers. + +::::::::::::::: solution + +### Tip 1.5 + +The functions to convert data types start with `as.`. You can look +for the function further up in the manuscript or use the RStudio +auto-complete function: Type "`as.`" and then press the TAB key. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.5 +> +> There are two functions that are synonymous for historic reasons: +> +> ``` +> cats$weight <- as.double(cats$weight) +> cats$weight <- as.numeric(cats$weight) +> ``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Some basic vector functions + +The combine function, `c()`, will also append things to an existing vector: + +```{r} +ab_vector <- c('a', 'b') +ab_vector +combine_example <- c(ab_vector, 'SWC') +combine_example +``` + +You can also make series of numbers: + +```{r} +mySeries <- 1:10 +mySeries +seq(10) +seq(1,10, by=0.1) +``` + +We can ask a few questions about vectors: + +```{r} +sequence_example <- 20:25 +head(sequence_example, n=2) +tail(sequence_example, n=4) +length(sequence_example) +typeof(sequence_example) +``` + +We can get individual elements of a vector by using the bracket notation: + +```{r} +first_element <- sequence_example[1] +first_element +``` + +To change a single element, use the bracket on the other side of the arrow: + +```{r} +sequence_example[1] <- 30 +sequence_example +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 2 + +Start by making a vector with the numbers 1 through 26. +Then, multiply the vector by 2. + +::::::::::::::: solution + +### Solution to Challenge 2 + +```{r} +x <- 1:26 +x <- x * 2 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Lists + +Another data structure you'll want in your bag of tricks is the `list`. A list +is simpler in some ways than the other types, because you can put anything you +want in it. Remember _everything in the vector must be of the same basic data type_, +but a list can have different data types: + +```{r} +list_example <- list(1, "a", TRUE, 1+4i) +list_example +``` + +When printing the object structure with `str()`, we see the data types of all +elements: + +```{r} +str(list_example) +``` + +What is the use of lists? They can **organize data of different types**. For +example, you can organize different tables that belong together, similar to +spreadsheets in Excel. But there are many other uses, too. + +We will see another example that will maybe surprise you in the next chapter. + +To retrieve one of the elements of a list, use the **double bracket**: + +```{r} +list_example[[2]] +``` + +The elements of lists also can have **names**, they can be given by prepending +them to the values, separated by an equals sign: + +```{r} +another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE ) +another_list +``` + +This results in a **named list**. Now we have a new function of our object! +We can access single elements by an additional way! + +```{r} +another_list$title +``` + +## Names + +With names, we can give meaning to elements. It is the first time that we do not +only have the **data**, but also explaining information. It is _metadata_ +that can be stuck to the object like a label. In R, this is called an +**attribute**. Some attributes enable us to do more with our +object, for example, like here, accessing an element by a self-defined name. + +### Accessing vectors and lists by name + +We have already seen how to generate a named list. The way to generate a named +vector is very similar. You have seen this function before: + +```{r} +pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 ) +``` + +The way to retrieve elements is different, though: + +```{r} +pizza_price["pizzasubito"] +``` + +The approach used for the list does not work: + +```{r} +pizza_price$pizzafresh +``` + +It will pay off if you remember this error message, you will meet it in your own +analyses. It means that you have just tried accessing an element like it was in +a list, but it is actually in a vector. + +### Accessing and changing names + +If you are only interested in the names, use the `names()` function: + +```{r} +names(pizza_price) +``` + +We have seen how to access and change single elements of a vector. The same is +possible for names: + +```{r} +names(pizza_price)[3] +names(pizza_price)[3] <- "call-a-pizza" +pizza_price +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 3 + +- What is the data type of the names of `pizza_price`? You can find out + using the `str()` or `typeof()` functions. + +::::::::::::::: solution + +### Solution to Challenge 3 + +You get the names of an object by wrapping the object name inside +`names(...)`. Similarly, you get the data type of the names by again +wrapping the whole code in `typeof(...)`: + +``` +typeof(names(pizza)) +``` + +alternatively, use a new variable if this is easier for you to read: + +``` +n <- names(pizza) +typeof(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 4 + +Instead of just changing some of the names a vector/list already has, you can +also set all names of an object by writing code like (replace ALL CAPS text): + +``` +names( OBJECT ) <- CHARACTER_VECTOR +``` + +Create a vector that gives the number for each letter in the alphabet! + +1. Generate a vector called `letter_no` with the sequence of numbers from 1 + to 26! +2. R has a built-in object called `LETTERS`. It is a 26-character vector, from + A to Z. Set the names of the number sequence to this 26 letters +3. Test yourself by calling `letter_no["B"]`, which should give you the number + 2! + +::::::::::::::: solution + +### Solution to Challenge 4 + +``` +letter_no <- 1:26 # or seq(1,26) +names(letter_no) <- LETTERS +letter_no["B"] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +We have data frames at the very beginning of this lesson, they represent +a table of data. We didn't go much further into detail with our example cat +data frame: + +```{r} +cats +``` + +We can now understand something a bit surprising in our data.frame; what happens +if we run: + +```{r} +typeof(cats) +``` + +We see that data.frames look like lists 'under the hood'. Think again what we +heard about what lists can be used for: + +> Lists organize data of different types + +Columns of a data frame are vectors of different types, that are organized +by belonging to the same table. + +A data.frame is really a list of vectors. It is a special list in which all the +vectors must have the same length. + +How is this "special"-ness written into the object, so that R does not treat it +like any other list, but as a table? + +```{r} +class(cats) +``` + +A **class**, just like names, is an attribute attached to the object. It tells +us what this object means for humans. + +You might wonder: Why do we need another what-type-of-object-is-this-function? +We already have `typeof()`? That function tells us how the object is +**constructed in the computer**. The `class` is the **meaning of the object for +humans**. Consequently, what `typeof()` returns is _fixed_ in R (mainly the +five data types), whereas the output of `class()` is _diverse_ and _extendable_ +by R packages. + +In our `cats` example, we have an integer, a double and a logical variable. As +we have seen already, each column of data.frame is a vector. + +```{r} +cats$coat +cats[,1] +typeof(cats[,1]) +str(cats[,1]) +``` + +Each row is an _observation_ of different variables, itself a data.frame, and +thus can be composed of elements of different types. + +```{r} +cats[1,] +typeof(cats[1,]) +str(cats[1,]) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 5 + +There are several subtly different ways to call variables, observations and +elements from data.frames: + +- `cats[1]` +- `cats[[1]]` +- `cats$coat` +- `cats["coat"]` +- `cats[1, 1]` +- `cats[, 1]` +- `cats[1, ]` + +Try out these examples and explain what is returned by each one. + +_Hint:_ Use the function `typeof()` to examine what is returned in each case. + +::::::::::::::: solution + +### Solution to Challenge 5 + +```{r, eval=TRUE, echo=TRUE} +cats[1] +``` + +We can think of a data frame as a list of vectors. The single brace `[1]` +returns the first slice of the list, as another list. In this case it is the +first column of the data frame. + +```{r, eval=TRUE, echo=TRUE} +cats[[1]] +``` + +The double brace `[[1]]` returns the contents of the list item. In this case +it is the contents of the first column, a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats$coat +``` + +This example uses the `$` character to address items by name. _coat_ is the +first column of the data frame, again a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats["coat"] +``` + +Here we are using a single brace `["coat"]` replacing the index number with +the column name. Like example 1, the returned object is a _list_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, 1] +``` + +This example uses a single brace, but this time we provide row and column +coordinates. The returned object is the value in row 1, column 1. The object +is a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats[, 1] +``` + +Like the previous example we use single braces and provide row and column +coordinates. The row coordinate is not specified, R interprets this missing +value as all the elements in this _column_ and returns them as a _vector_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, ] +``` + +Again we use the single brace with row and column coordinates. The column +coordinate is not specified. The return value is a _list_ containing all the +values in the first row. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Renaming data frame columns + +Data frames have column names, which can be accessed with the `names()` function. + +```{r} +names(cats) +``` + +If you want to rename the second column of `cats`, you can assign a new name to the second element of `names(cats)`. + +```{r} +names(cats)[2] <- "weight_kg" +cats +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +# reverting cats back to original version +cats <- cats_orig +``` + +### Matrices + +Last but not least is the matrix. We can declare a matrix full of zeros: + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +matrix_example +``` + +What makes it special is the `dim()` attribute: + +```{r} +dim(matrix_example) +``` + +And similar to other data structures, we can ask things about our matrix: + +```{r} +typeof(matrix_example) +class(matrix_example) +str(matrix_example) +nrow(matrix_example) +ncol(matrix_example) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? +Try it. +Were you right? Why / why not? + +::::::::::::::: solution + +### Solution to Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +length(matrix_example) +``` + +Because a matrix is a vector with added dimension attributes, `length` +gives you the total number of elements in the matrix. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +::::::::::::::: solution + +### Solution to Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +```{r, eval=FALSE} +x <- matrix(1:50, ncol=5, nrow=10) +x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 8 + +Create a list of length two containing a character vector for each of the sections in this part of the workshop: + +- Data types +- Data structures + +Populate each character vector with the names of the data types and data +structures we've seen so far. + +::::::::::::::: solution + +### Solution to Challenge 8 + +```{r} +dataTypes <- c('double', 'complex', 'integer', 'character', 'logical') +dataStructures <- c('data.frame', 'vector', 'list', 'matrix') +answer <- list(dataTypes, dataStructures) +``` + +Note: it's nice to make a list in big writing on the board or taped to the wall +listing all of these types and structures - leave it up for the rest of the workshop +to remind people of the importance of these basics. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)` +2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)` +3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)` +4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)` + +::::::::::::::: solution + +### Solution to Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +```{r, eval=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `read.csv` to read tabular data in R. +- The basic data types in R are double, integer, complex, logical, and character. +- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/05-data-structures-part2.Rmd b/locale/it/episodes/05-data-structures-part2.Rmd new file mode 100644 index 000000000..abc4d714a --- /dev/null +++ b/locale/it/episodes/05-data-structures-part2.Rmd @@ -0,0 +1,395 @@ +--- +title: Exploring Data Frames +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Add and remove rows or columns. +- Append two data frames. +- Display basic properties of data frames including size and class of the columns, names, and first few rows. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +At this point, you've seen it all: in the last lesson, we toured all the basic +data types and data structures in R. Everything you do will be a manipulation of +those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we'll learn a few more things +about working with data frames. + +## Adding columns and rows in data frames + +We already learned that the columns of a data frame are vectors, so that our +data are consistent in type throughout the columns. As such, if we want to add a +new column, we can start by making a new vector: + +```{r, echo=FALSE} +cats <- read.csv("data/feline-data.csv") +``` + +```{r} +age <- c(2, 3, 5) +cats +``` + +We can then add this as a column via: + +```{r} +cbind(cats, age) +``` + +Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail: + +```{r, error=TRUE} +age <- c(2, 3, 5, 12) +cbind(cats, age) + +age <- c(2, 3) +cbind(cats, age) +``` + +Why didn't this work? Of course, R wants to see one element in our new column +for every row in the table: + +```{r} +nrow(cats) +length(age) +``` + +So for it to work we need to have `nrow(cats)` = `length(age)`. Let's overwrite the content of cats with our new data frame. + +```{r} +age <- c(2, 3, 5) +cats <- cbind(cats, age) +``` + +Now how about adding rows? We already know that the rows of a +data frame are lists: + +```{r} +newRow <- list("tortoiseshell", 3.3, TRUE, 9) +cats <- rbind(cats, newRow) +``` + +Let's confirm that our new row was added correctly. + +```{r} +cats +``` + +## Removing rows + +We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows. + +```{r} +cats +``` + +We can ask for a data frame minus the last row: + +```{r} +cats[-4, ] +``` + +Notice the comma with nothing after it to indicate that we want to drop the entire fourth row. + +Note: we could also remove several rows at once by putting the row numbers +inside of a vector, for example: `cats[c(-3,-4), ]` + +## Removing columns + +We can also remove columns in our data frame. What if we want to remove the column "age". We can remove it in two ways, by variable number or by index. + +```{r} +cats[,-4] +``` + +Notice the comma with nothing before it, indicating we want to keep all of the rows. + +Alternatively, we can drop the column by using the index name and the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `cats`, and asks, "Does this element occur in the second argument?" + +```{r} +drop <- names(cats) %in% c("age") +cats[,!drop] +``` + +We will cover subsetting with logical operators like `%in%` in more detail in the next episode. See the section [Subsetting through other logical operations](06-data-subsetting.Rmd) + +## Appending to a data frame + +The key to remember when adding data to a data frame is that _columns are +vectors and rows are lists._ We can also glue two data frames +together with `rbind`: + +```{r} +cats <- rbind(cats, cats) +cats +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +You can create a new data frame right from within R with the following syntax: + +```{r} +df <- data.frame(id = c("a", "b", "c"), + x = 1:3, + y = c(TRUE, TRUE, FALSE)) +``` + +Make a data frame that holds the following information for yourself: + +- first name +- last name +- lucky number + +Then use `rbind` to add an entry for the people sitting beside you. +Finally, use `cbind` to add a column with each person's answer to the question, "Is it time for coffee break?" + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +df <- data.frame(first = c("Grace"), + last = c("Hopper"), + lucky_number = c(0)) +df <- rbind(df, list("Marie", "Curie", 238) ) +df <- cbind(df, coffeetime = c(TRUE,TRUE)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Realistic example + +So far, you have seen the basics of manipulating data frames with our cat data; +now let's use those skills to digest a more realistic dataset. Let's read in the +`gapminder` dataset that we downloaded previously: + +```{r} +gapminder <- read.csv("data/gapminder_data.csv") +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Miscellaneous Tips + +- Another type of file you might encounter are tab-separated value files (.tsv). To specify a tab as a separator, use `"\\t"` or `read.delim()`. + +- Files can also be downloaded directly from the Internet into a local + folder of your choice onto your computer using the `download.file` function. + The `read.csv` function can then be executed to read the downloaded file from the download location, for example, + +```{r, eval=FALSE, echo=TRUE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv("data/gapminder_data.csv") +``` + +- Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example, + +```{r, eval=FALSE, echo=TRUE} +gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv") +``` + +- You can read directly from excel spreadsheets without + converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package. + +- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's investigate gapminder a bit; the first thing we should always do is check +out what the data looks like with `str`: + +```{r} +str(gapminder) +``` + +An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode. + +```{r} +summary(gapminder) +``` + +Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function: + +```{r} +typeof(gapminder$year) +typeof(gapminder$country) +str(gapminder$country) +``` + +We can also interrogate the data frame for information about its dimensions; +remembering that `str(gapminder)` said there were 1704 observations of 6 +variables in gapminder, what do you think the following will produce, and why? + +```{r} +length(gapminder) +``` + +A fair guess would have been to say that the length of a data frame would be the +number of rows it has (1704), but this is not the case; remember, a data frame +is a _list of vectors and factors_: + +```{r} +typeof(gapminder) +``` + +When `length` gave us 6, it's because gapminder is built out of a list of 6 +columns. To get the number of rows and columns in our dataset, try: + +```{r} +nrow(gapminder) +ncol(gapminder) +``` + +Or, both at once: + +```{r} +dim(gapminder) +``` + +We'll also likely want to know what the titles of all the columns are, so we can +ask for them later: + +```{r} +colnames(gapminder) +``` + +At this stage, it's important to ask ourselves if the structure R is reporting +matches our intuition or expectations; do the basic data types reported for each +column make sense? If not, we need to sort any problems out now before they turn +into bad surprises down the road, using what we've learned about how R +interprets data, and the importance of _strict consistency_ in how we record our +data. + +Once we're happy that the data types and structures seem reasonable, it's time +to start digging into our data proper. Check out the first few lines: + +```{r} +head(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +It's good practice to also check the last few lines of your data and some in the middle. How would you do this? + +Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To check the last few lines it's relatively simple as R already has a function for this: + +```r +tail(gapminder) +tail(gapminder, n = 15) +``` + +What about a few arbitrary rows just in case something is odd in the middle? + +## Tip: There are several ways to achieve this. + +The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it! +Remember my\_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don't know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this. + +```r +gapminder[sample(nrow(gapminder), 5), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Go to file -> new file -> R script, and write an R script +to load in the gapminder dataset. Put it in the `scripts/` +directory and add it to version control. + +Run the script using the `source` function, using the file path +as its argument (or by pressing the "source" button in RStudio). + +::::::::::::::: solution + +## Solution to Challenge 3 + +The `source` function can be used to use a script within a script. +Assume you would like to load the same type of file over and over +again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again +and again you could just write it once and save it as a script. Then, +you can use `source("Your_Script_containing_the_load_function")` in a new +script to use the function of that script without writing everything again. +Check out `?source` to find out more. + +```{r, eval=FALSE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv(file = "data/gapminder_data.csv") +``` + +To run the script and load the data into the `gapminder` variable: + +```{r, eval=FALSE} +source(file = "scripts/load-gapminder.R") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Read the output of `str(gapminder)` again; +this time, use what you've learned about lists and vectors, +as well as the output of functions like `colnames` and `dim` +to explain what everything that `str` prints out for gapminder means. +If there are any parts you can't interpret, discuss with your neighbors! + +::::::::::::::: solution + +## Solution to Challenge 4 + +The object `gapminder` is a data frame with columns + +- `country` and `continent` are character strings. +- `year` is an integer vector. +- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `cbind()` to add a new column to a data frame. +- Use `rbind()` to add a new row to a data frame. +- Remove rows from a data frame. +- Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `head()`, and `typeof()` to understand the structure of a data frame. +- Read in a csv file using `read.csv()`. +- Understand what `length()` of a data frame represents. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/06-data-subsetting.Rmd b/locale/it/episodes/06-data-subsetting.Rmd new file mode 100644 index 000000000..23242457e --- /dev/null +++ b/locale/it/episodes/06-data-subsetting.Rmd @@ -0,0 +1,863 @@ +--- +title: Subsetting Data +teaching: 35 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to subset vectors, factors, matrices, lists, and data frames +- To be able to extract individual and multiple elements: by index, by name, using comparison operations +- To be able to skip and remove elements from various data structures. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I work with subsets of data in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +R has many powerful subset operators. Mastering them will allow you to +easily perform complex operations on any kind of dataset. + +There are six different ways we can subset any kind of object, and three +different subsetting operators for the different data structures. + +Let's start with the workhorse of R: a simple numeric vector. + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +x +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Atomic vectors + +In R, simple vectors containing character strings, numbers, or logical values are called _atomic_ vectors because they can't be further simplified. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +So now that we've created a dummy vector to play with, how do we get at its +contents? + +## Accessing elements using their indices + +To extract elements of a vector we can give their corresponding index, starting +from one: + +```{r} +x[1] +``` + +```{r} +x[4] +``` + +It may look different, but the square brackets operator is a function. For vectors +(and matrices), it means "get me the nth element". + +We can ask for multiple elements at once: + +```{r} +x[c(1, 3)] +``` + +Or slices of the vector: + +```{r} +x[1:4] +``` + +the `:` operator creates a sequence of numbers from the left element to the right. + +```{r} +1:4 +c(1, 2, 3, 4) +``` + +We can ask for the same element multiple times: + +```{r} +x[c(1,1,3)] +``` + +If we ask for an index beyond the length of the vector, R will return a missing value: + +```{r} +x[6] +``` + +This is a vector of length one containing an `NA`, whose name is also `NA`. + +If we ask for the 0th element, we get an empty vector: + +```{r} +x[0] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Vector numbering in R starts at 1 + +In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping and removing elements + +If we use a negative number as the index of a vector, R will return +every element _except_ for the one specified: + +```{r} +x[-2] +``` + +We can skip multiple elements: + +```{r} +x[c(-1, -5)] # or x[-c(1,5)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Order of operations + +A common trip up for novices occurs when trying to skip +slices of a vector. It's natural to try to negate a +sequence like so: + +```{r, error=TRUE, eval=FALSE} +x[-1:3] +``` + +This gives a somewhat cryptic error: + +```{r, error=TRUE, echo=FALSE} +x[-1:3] +``` + +But remember the order of operations. `:` is really a function. +It takes its first argument as -1, and its second as 3, +so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`. + +The correct solution is to wrap that function call in brackets, so +that the `-` operator applies to the result: + +```{r} +x[-(1:3)] +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To remove elements from a vector, we need to assign the result back +into the variable: + +```{r} +x <- x[-4] +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Come up with at least 2 different commands that will produce the following output: + +```{r, echo=FALSE} +x[2:4] +``` + +After you find 2 different commands, compare notes with your neighbour. Did you have different strategies? + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r} +x[2:4] +``` + +```{r} +x[-c(1,5)] +``` + +```{r} +x[c(2,3,4)] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Subsetting by name + +We can extract elements by using their name, instead of extracting by index: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly' +x[c("a", "c")] +``` + +This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same! + +## Subsetting through other logical operations {#logical-operations} + +We can also use any logical vector to subset: + +```{r} +x[c(FALSE, FALSE, TRUE, FALSE, TRUE)] +``` + +Since comparison operators (e.g. `>`, `<`, `==`) evaluate to logical vectors, we can also +use them to succinctly subset vectors: the following statement gives +the same result as the previous one. + +```{r} +x[x > 7] +``` + +Breaking it down, this statement first evaluates `x>7`, generating +a logical vector `c(FALSE, FALSE, TRUE, FALSE, TRUE)`, and then +selects the elements of `x` corresponding to the `TRUE` values. + +We can use `==` to mimic the previous method of indexing by name +(remember you have to use `==` rather than `=` for comparisons): + +```{r} +x[names(x) == "a"] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Combining logical conditions + +We often want to combine multiple logical +criteria. For example, we might want to find all the countries that are +located in Asia **or** Europe **and** have life expectancies within a certain +range. Several operations for combining logical vectors exist in R: + +- `&`, the "logical AND" operator: returns `TRUE` if both the left and right + are `TRUE`. +- `|`, the "logical OR" operator: returns `TRUE`, if either the left or right + (or both) are `TRUE`. + +You may sometimes see `&&` and `||` instead of `&` and `|`. These two-character operators +only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them +for programming, i.e. deciding whether to execute a statement. + +- `!`, the "logical NOT" operator: converts `TRUE` to `FALSE` and `FALSE` to + `TRUE`. It can negate a single logical condition (eg `!TRUE` becomes + `FALSE`), or a whole vector of conditions(eg `!c(TRUE, FALSE)` becomes + `c(FALSE, TRUE)`). + +Additionally, you can compare the elements within a single vector using the +`all` function (which returns `TRUE` if every element of the vector is `TRUE`) +and the `any` function (which returns `TRUE` if one or more elements of the +vector are `TRUE`). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Write a subsetting command to return the values in x that are greater than 4 and less than 7. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r} +x_subset <- x[x<7 & x>4] +print(x_subset) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Non-unique names + +You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have +the same name --- although R tries to avoid this --- but row names +must be unique.) Consider these examples: + +```{r} +x <- 1:3 +x +names(x) <- c('a', 'a', 'a') +x +x['a'] # only returns first value +x[names(x) == 'a'] # returns all three values +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Getting help for operators + +Remember you can search for help on operators by wrapping them in quotes: +`help("%in%")` or `?"%in%"`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping named elements + +Skipping or removing named elements is a little harder. If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn't know how to take the negative of a string: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly' +x[-"a"] +``` + +However, we can use the `!=` (not-equals) operator to construct a logical vector that will do what we want: + +```{r} +x[names(x) != "a"] +``` + +Skipping multiple named indices is a little bit harder still. Suppose we want to drop the `"a"` and `"c"` elements, so we try this: + +```{r} +x[names(x)!=c("a","c")] +``` + +R did _something_, but it gave us a warning that we ought to pay attention to - and it apparently _gave us the wrong answer_ (the `"c"` element is still included in the vector)! + +So what does `!=` actually do in this case? That's an excellent question. + +### Recycling + +Let's take a look at the comparison component of this code: + +```{r} +names(x) != c("a", "c") +``` + +Why does R give `TRUE` as the third element of this vector, when `names(x)[3] != "c"` is obviously false? +When you use `!=`, R tries to compare each element +of the left argument with the corresponding element of its right +argument. What happens when you compare vectors of different lengths? + +![](fig/06-rmd-inequality.1.png){alt='Inequality testing'} + +When one vector is shorter than the other, it gets _recycled_: + +![](fig/06-rmd-inequality.2.png){alt='Inequality testing: results of recycling'} + +In this case R **repeats** `c("a", "c")` as many times as necessary to match `names(x)`, i.e. we get `c("a","c","a","c","a")`. Since the recycled `"a"` +doesn't match the third element of `names(x)`, the value of `!=` is `TRUE`. +Because in this case the longer vector length (5) isn't a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and `names(x)` had contained six elements, R would _silently_ have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs! + +The way to get R to do what we really want (match _each_ element of the left argument with _all_ of the elements of the right argument) it to use the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". Here, since we want to _exclude_ values, we also need a `!` operator to change "in" to "not in": + +```{r} +x[! names(x) %in% c("a","c") ] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains `country` and `continent` variables, but no information between +these two scales. Suppose we want to pull out information from southeast +Asia: how do we set up an operation to produce a logical vector that +is `TRUE` for all of the countries in southeast Asia and `FALSE` otherwise? + +Suppose you have these data: + +```{r} +seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos") +## read in the gapminder data that we downloaded in episode 2 +gapminder <- read.csv("data/gapminder_data.csv", header=TRUE) +## extract the `country` column from a data frame (we'll see this later); +## convert from a factor to a character; +## and get just the non-repeated elements +countries <- unique(as.character(gapminder$country)) +``` + +There's a wrong way (using only `==`), which will give you a warning; +a clunky way (using the logical operators `==` and `|`); and +an elegant way (using `%in%`). See whether you can come up with all three +and explain how they (don't) work. + +::::::::::::::: solution + +## Solution to challenge 3 + +- The **wrong** way to do this problem is `countries==seAsia`. This + gives a warning (`"In countries == seAsia : longer object length is not a multiple of shorter object length"`) and the wrong answer (a vector of all + `FALSE` values), because none of the recycled values of `seAsia` happen + to line up correctly with matching values in `country`. +- The **clunky** (but technically correct) way to do this problem is + +```{r, results="hide"} + (countries=="Myanmar" | countries=="Thailand" | + countries=="Cambodia" | countries == "Vietnam" | countries=="Laos") +``` + +(or `countries==seAsia[1] | countries==seAsia[2] | ...`). This +gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?). + +- The best way to do this problem is `countries %in% seAsia`, which + is both correct and easy to type (and read). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Handling special values + +At some point you will encounter functions in R that cannot handle missing, infinite, +or undefined data. + +There are a number of special functions you can use to filter out this data: + +- `is.na` will return all positions in a vector, matrix, or data.frame + containing `NA` (or `NaN`) +- likewise, `is.nan`, and `is.infinite` will do the same for `NaN` and `Inf`. +- `is.finite` will return all positions in a vector, matrix, or data.frame + that do not contain `NA`, `NaN` or `Inf`. +- `na.omit` will filter out all missing values from a vector + +## Factor subsetting + +Now that we've explored the different ways to subset vectors, how +do we subset the other data structures? + +Factor subsetting works the same way as vector subsetting. + +```{r} +f <- factor(c("a", "a", "b", "c", "c", "d")) +f[f == "a"] +f[f %in% c("b", "c")] +f[1:3] +``` + +Skipping elements will not remove the level +even if no more of that category exists in the factor: + +```{r} +f[-3] +``` + +## Matrix subsetting + +Matrices are also subsetted using the `[` function. In this case +it takes two arguments: the first applying to the rows, the second +to its columns: + +```{r} +set.seed(1) +m <- matrix(rnorm(6*4), ncol=4, nrow=6) +m[3:4, c(3,1)] +``` + +You can leave the first or second arguments blank to retrieve all the +rows or columns respectively: + +```{r} +m[, c(3,4)] +``` + +If we only access one row or column, R will automatically convert the result +to a vector: + +```{r} +m[3,] +``` + +If you want to keep the output as a matrix, you need to specify a _third_ argument; +`drop = FALSE`: + +```{r} +m[3, , drop=FALSE] +``` + +Unlike vectors, if we try to access a row or column outside of the matrix, +R will throw an error: + +```{r, error=TRUE} +m[, c(3,6)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Higher dimensional arrays + +when dealing with multi-dimensional arrays, each argument to `[` +corresponds to a dimension. For example, a 3D array, the first three +arguments correspond to the rows, columns, and depth dimension. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Because matrices are vectors, we can +also subset using only one argument: + +```{r} +m[5] +``` + +This usually isn't useful, and often confusing to read. However it is useful to note that matrices +are laid out in _column-major format_ by default. That is the elements of the +vector are arranged column-wise: + +```{r} +matrix(1:6, nrow=2, ncol=3) +``` + +If you wish to populate the matrix by row, use `byrow=TRUE`: + +```{r} +matrix(1:6, nrow=2, ncol=3, byrow=TRUE) +``` + +Matrices can also be subsetted using their rownames and column names +instead of their row and column indices. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Given the following code: + +```{r} +m <- matrix(1:18, nrow=3, ncol=6) +print(m) +``` + +1. Which of the following commands will extract the values 11 and 14? + +A. `m[2,4,2,5]` + +B. `m[2:5]` + +C. `m[4:5,2]` + +D. `m[2,c(4,5)]` + +::::::::::::::: solution + +## Solution to challenge 4 + +D + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## List subsetting + +Now we'll introduce some new subsetting operators. There are three functions +used to subset lists. We've already seen these when learning about atomic vectors and matrices: `[`, `[[`, and `$`. + +Using `[` will always return a list. If you want to _subset_ a list, but not +_extract_ an element, then you will likely use `[`. + +```{r} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +xlist[1] +``` + +This returns a _list with one element_. + +We can subset elements of a list exactly the same way as atomic +vectors using `[`. Comparison operations however won't work as +they're not recursive, they will try to condition on the data structures +in each element of the list, not the individual elements within those +data structures. + +```{r} +xlist[1:2] +``` + +To extract individual elements of a list, you need to use the double-square +bracket function: `[[`. + +```{r} +xlist[[1]] +``` + +Notice that now the result is a vector, not a list. + +You can't extract more than one element at once: + +```{r, error=TRUE} +xlist[[1:2]] +``` + +Nor use it to skip elements: + +```{r, error=TRUE} +xlist[[-1]] +``` + +But you can use names to both subset and extract elements: + +```{r} +xlist[["a"]] +``` + +The `$` function is a shorthand way for extracting elements by name: + +```{r} +xlist$data +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Given the following list: + +```{r, eval=FALSE} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +``` + +Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +Hint: the number 2 is contained within the "b" item in the list. + +::::::::::::::: solution + +## Solution to challenge 5 + +```{r} +xlist$b[2] +``` + +```{r} +xlist[[2]][2] +``` + +```{r} +xlist[["b"]][2] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 6 + +Given a linear model: + +```{r, eval=FALSE} +mod <- aov(pop ~ lifeExp, data=gapminder) +``` + +Extract the residual degrees of freedom (hint: `attributes()` will help you) + +::::::::::::::: solution + +## Solution to challenge 6 + +```{r, eval=FALSE} +attributes(mod) ## `df.residual` is one of the names of `mod` +``` + +```{r, eval=FALSE} +mod$df.residual +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +Remember the data frames are lists underneath the hood, so similar rules +apply. However they are also two dimensional objects: + +`[` with one argument will act the same way as for lists, where each list +element corresponds to a column. The resulting object will be a data frame: + +```{r} +head(gapminder[3]) +``` + +Similarly, `[[` will act to extract _a single column_: + +```{r} +head(gapminder[["lifeExp"]]) +``` + +And `$` provides a convenient shorthand to extract columns by name: + +```{r} +head(gapminder$year) +``` + +With two arguments, `[` behaves the same way as for matrices: + +```{r} +gapminder[1:3,] +``` + +If we subset a single row, the result will be a data frame (because +the elements are mixed types): + +```{r} +gapminder[3,] +``` + +But for a single column the result will be a vector (this can +be changed with the third argument, `drop = FALSE`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +gapminder[gapminder$year = 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +gapminder[,-1:4] +``` + +3. Extract the rows where the life expectancy is longer the 80 years + +```{r, eval=FALSE} +gapminder[gapminder$lifeExp > 80] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +gapminder[1, 4, 5] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +gapminder[gapminder$year == 2002 | 2007,] +``` + +::::::::::::::: solution + +## Solution to challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +# gapminder[gapminder$year = 1957,] +gapminder[gapminder$year == 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +# gapminder[,-1:4] +gapminder[,-c(1:4)] +``` + +3. Extract the rows where the life expectancy is longer than 80 years + +```{r, eval=FALSE} +# gapminder[gapminder$lifeExp > 80] +gapminder[gapminder$lifeExp > 80,] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +# gapminder[1, 4, 5] +gapminder[1, c(4, 5)] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +# gapminder[gapminder$year == 2002 | 2007,] +gapminder[gapminder$year == 2002 | gapminder$year == 2007,] +gapminder[gapminder$year %in% c(2002, 2007),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 8 + +1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? + +2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 + and 19 through 23. You can do this in one or two steps. + +::::::::::::::: solution + +## Solution to challenge 8 + +1. `gapminder` is a data.frame so needs to be subsetted on two dimensions. `gapminder[1:20, ]` subsets the data to give the first 20 rows and all columns. + +2. + +```{r} +gapminder_small <- gapminder[c(1:9, 19:23),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Indexing in R starts at 1, not 0. +- Access individual values by location using `[]`. +- Access slices of data using `[low:high]`. +- Access arbitrary sets of data using `[c(...)]`. +- Use logical operations and logical vectors to access subsets of data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/07-control-flow.Rmd b/locale/it/episodes/07-control-flow.Rmd new file mode 100644 index 000000000..39946a2c4 --- /dev/null +++ b/locale/it/episodes/07-control-flow.Rmd @@ -0,0 +1,565 @@ +--- +title: Control Flow +teaching: 45 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Write conditional statements with `if...else` statements and `ifelse()`. +- Write and understand `for()` loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make data-dependent choices in R? +- How can I repeat operations in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +set.seed(10) +``` + +Often when we're coding we want to control the flow of our actions. This can be done +by setting actions to occur only if a condition or a set of conditions are met. +Alternatively, we can also set an action to occur a particular number of times. + +There are several ways you can control flow in R. +For conditional statements, the most commonly used approaches are the constructs: + +```{r, eval=FALSE} +# if +if (condition is true) { + perform action +} + +# if ... else +if (condition is true) { + perform action +} else { # that is, if the condition is false, + perform alternative action +} +``` + +Say, for example, that we want R to print a message if a variable `x` has a particular value: + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} + +x +``` + +The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else { + print("x is less than 10") +} +``` + +You can also test multiple conditions by using `else if`. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else if (x > 5) { + print("x is greater than 5, but less than 10") +} else { + print("x is less than 5") +} +``` + +**Important:** when R evaluates the condition inside `if()` statements, it is +looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some +headaches for beginners. For example: + +```{r} +x <- 4 == 3 +if (x) { + "4 equals 3" +} else { + "4 does not equal 3" +} +``` + +As we can see, the not equal message was printed because the vector x is `FALSE` + +```{r} +x <- 4 == 3 +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Use an `if()` statement to print a suitable message +reporting whether there are any records from 2002 in +the `gapminder` dataset. +Now do the same for 2012. + +::::::::::::::: solution + +## Solution to Challenge 1 + +We will first see a solution to Challenge 1 which does not use the `any()` function. +We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`: + +```{r ch10pt1-sol, eval=FALSE} +gapminder[(gapminder$year == 2002),] +``` + +Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002: + +```{r ch10pt2-sol, eval=FALSE} +rows2002_number <- nrow(gapminder[(gapminder$year == 2002),]) +``` + +The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more: + +```{r ch10pt3-sol, eval=FALSE} +rows2002_number >= 1 +``` + +Putting all together, we obtain: + +```{r ch10pt4-sol, eval=FALSE} +if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){ + print("Record(s) for the year 2002 found.") +} +``` + +All this can be done more quickly with `any()`. The logical condition can be expressed as: + +```{r ch10pt5-sol, eval=FALSE} +if(any(gapminder$year == 2002)){ + print("Record(s) for the year 2002 found.") +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Did anyone get a warning message like this? + +```{r, echo=FALSE} +if (gapminder$year == 2012) {} +``` + +The `if()` function only accepts singular (of length 1) inputs, and therefore +returns an error when you use it with a vector. The `if()` function will still +run, but will only evaluate the condition in the first element of the vector. +Therefore, to use the `if()` function, you need to make sure your input is +singular (of length 1). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Built in `ifelse()` function + +`R` accepts both `if()` and `else if()` statements structured as outlined above, +but also statements using `R`'s built-in `ifelse()` function. This +function accepts both singular and vector inputs and is structured as +follows: + +```{r, eval=FALSE} +# ifelse function +ifelse(condition is true, perform action, perform alternative action) + +``` + +where the first argument is the condition or a set of conditions to be met, the +second argument is the statement that is evaluated when the condition is `TRUE`, +and the third statement is the statement that is evaluated when the condition +is `FALSE`. + +```{r} +y <- -3 +ifelse(y < 0, "y is a negative number", "y is either positive or zero") + +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: `any()` and `all()` + +The `any()` function will return `TRUE` if at least one +`TRUE` value is found within a vector, otherwise it will return `FALSE`. +This can be used in a similar way to the `%in%` operator. +The function `all()`, as the name suggests, will only return `TRUE` if all values in +the vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Repeating operations + +If you want to iterate over +a set of values, when the order of iteration is important, and perform the +same operation on each, a `for()` loop will do the job. +We saw `for()` loops in the [shell lessons earlier](https://swcarpentry.github.io/shell-novice/05-loop.html). This is the most +flexible of looping operations, but therefore also the hardest to use +correctly. In general, the advice of many `R` users would be to learn about +`for()` loops, but to avoid using `for()` loops unless the order of iteration is +important: i.e. the calculation at each iteration depends on the results of +previous iterations. If the order of iteration is not important, then you +should learn about vectorized alternatives, such as the `purrr` package, as they +pay off in computational efficiency. + +The basic structure of a `for()` loop is: + +```{r, eval=FALSE} +for (iterator in set of values) { + do a thing +} +``` + +For example: + +```{r} +for (i in 1:10) { + print(i) +} +``` + +The `1:10` bit creates a vector on the fly; you can iterate +over any other vector as well. + +We can use a `for()` loop nested within another `for()` loop to iterate over two things at +once. + +```{r} +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + print(paste(i,j)) + } +} +``` + +We notice in the output that when the first index (`i`) is set to 1, the second +index (`j`) iterates through its full set of indices. Once the indices of `j` +have been iterated through, then `i` is incremented. This process continues +until the last index has been used for each `for()` loop. + +Rather than printing the results, we could write the loop output to a new object. + +```{r} +output_vector <- c() +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + temp_output <- paste(i, j) + output_vector <- c(output_vector, temp_output) + } +} +output_vector +``` + +This approach can be useful, but 'growing your results' (building +the result object incrementally) is computationally inefficient, so avoid +it when you are iterating through a lot of values. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: don't grow your results + +One of the biggest things that trips up novices and +experienced R users alike, is building a results object +(vector, list, matrix, data frame) as your for loop progresses. +Computers are very bad at handling this, so your calculations +can very quickly slow to a crawl. It's much better to define +an empty results object before hand of appropriate dimensions, rather +than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A better way is to define your (empty) output object before filling in the values. +For this example, it looks more involved, but is still more efficient. + +```{r} +output_matrix <- matrix(nrow = 5, ncol = 5) +j_vector <- c('a', 'b', 'c', 'd', 'e') +for (i in 1:5) { + for (j in 1:5) { + temp_j_value <- j_vector[j] + temp_output <- paste(i, temp_j_value) + output_matrix[i, j] <- temp_output + } +} +output_vector2 <- as.vector(output_matrix) +output_vector2 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: While loops + +Sometimes you will find yourself needing to repeat an operation as long as a certain +condition is met. You can do this with a `while()` loop. + +```{r, eval=FALSE} +while(this condition is true){ + do a thing +} +``` + +R will interpret a condition being met as "TRUE". + +As an example, here's a while loop +that generates random numbers from a uniform distribution (the `runif()` function) +between 0 and 1 until it gets one that's less than 0.1. + +```r +z <- 1 +while(z > 0.1){ + z <- runif(1) + cat(z, "\n") +} +``` + +`while()` loops will not always be appropriate. You have to be particularly careful +that you don't end up stuck in an infinite loop because your condition is always met and hence the while statement never terminates. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Compare the objects `output_vector` and +`output_vector2`. Are they the same? If not, why not? +How would you change the last block of code to make `output_vector2` +the same as `output_vector`? + +::::::::::::::: solution + +## Solution to Challenge 2 + +We can check whether the two vectors are identical using the `all()` function: + +```{r ch10pt6-sol, eval=FALSE} +all(output_vector == output_vector2) +``` + +However, all the elements of `output_vector` can be found in `output_vector2`: + +```{r ch10pt7-sol, eval=FALSE} +all(output_vector %in% output_vector2) +``` + +and vice versa: + +```{r ch10pt8-sol, eval=FALSE} +all(output_vector2 %in% output_vector) +``` + +therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order. +This is because `as.vector()` outputs the elements of an input matrix going over its column. +Taking a look at `output_matrix`, we can notice that we want its elements by rows. +The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function +`t()` or by inputting the elements in the right order. +The first solution requires to change the original + +```{r ch10pt9-sol, eval=FALSE} +output_vector2 <- as.vector(output_matrix) +``` + +into + +```{r ch10pt10-sol, eval=FALSE} +output_vector2 <- as.vector(t(output_matrix)) +``` + +The second solution requires to change + +```{r ch10pt11-sol, eval=FALSE} +output_matrix[i, j] <- temp_output +``` + +into + +```{r ch10pt12-sol, eval=FALSE} +output_matrix[j, i] <- temp_output +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Write a script that loops through the `gapminder` data by continent and prints out +whether the mean life expectancy is smaller or larger than 50 +years. + +::::::::::::::: solution + +## Solution to Challenge 3 + +**Step 1**: We want to make sure we can extract all the unique values of the continent vector + +```{r 07-chall-03-sol-a, eval=FALSE} +gapminder <- read.csv("data/gapminder_data.csv") +unique(gapminder$continent) +``` + +**Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data. +We can do that as follows: + +1. Loop over each of the unique values of 'continent' +2. For each value of continent, create a temporary variable storing that subset +3. Return the calculated life expectancy to the user by printing the output: + +```{r 07-chall-03-sol-b, eval=FALSE} +for (iContinent in unique(gapminder$continent)) { + tmp <- gapminder[gapminder$continent == iContinent, ] + cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n") + rm(tmp) +} +``` + +**Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50. +So we need to add an `if()` condition before printing, which evaluates whether the calculated average life expectancy is above or below a threshold, and prints an output conditional on the result. +We need to amend (3) from above: + +3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold: + +```{r 07-chall-03-sol-c, eval=FALSE} +thresholdValue <- 50 + +for (iContinent in unique(gapminder$continent)) { + tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"]) + + if (tmp < thresholdValue){ + cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n") + } else { + cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n") + } # end if else condition + rm(tmp) +} # end for loop + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Modify the script from Challenge 3 to loop over each +country. This time print out whether the life expectancy is +smaller than 50, between 50 and 70, or greater than 70. + +::::::::::::::: solution + +## Solution to Challenge 4 + +We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements: + +```{r 07-chall-04-sol, eval=FALSE} + lowerThreshold <- 50 + upperThreshold <- 70 + +for (iCountry in unique(gapminder$country)) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if(tmp < lowerThreshold) { + cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n") + } else if(tmp > lowerThreshold && tmp < upperThreshold) { + cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n") + } else { + cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n") + } + rm(tmp) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 - Advanced + +Write a script that loops over each country in the `gapminder` dataset, +tests whether the country starts with a 'B', and graphs life expectancy +against time as a line graph if the mean life expectancy is under 50 years. + +::::::::::::::: solution + +## Solution for Challenge 5 + +We will use the `grep()` command that was introduced in the [Unix Shell lesson](https://swcarpentry.github.io/shell-novice/07-find.html) +to find countries that start with "B." +Lets understand how to do this first. +Following from the Unix shell section we may be tempted to try the following + +```{r 07-chall-05-sol-a, eval=FALSE} +grep("^B", unique(gapminder$country)) +``` + +But when we evaluate this command it returns the indices of the factor variable `country` that start with "B." +To get the values, we must add the `value=TRUE` option to the `grep()` command: + +```{r 07-chall-05-sol-b, eval=FALSE} +grep("^B", unique(gapminder$country), value = TRUE) +``` + +We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy using `with()` and `subset()`: + +```{r 07-chall-05-sol-c, eval=FALSE} +thresholdValue <- 50 +candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE) + +for (iCountry in candidateCountries) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if (tmp < thresholdValue) { + cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n") + + with(subset(gapminder, country == iCountry), + plot(year, lifeExp, + type = "o", + main = paste("Life Expectancy in", iCountry, "over time"), + ylab = "Life Expectancy", + xlab = "Year" + ) # end plot + ) # end with + } # end if + rm(tmp) +} # end for loop +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `if` and `else` to make choices. +- Use `for` to repeat operations. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/08-plot-ggplot2.Rmd b/locale/it/episodes/08-plot-ggplot2.Rmd new file mode 100644 index 000000000..12998dc14 --- /dev/null +++ b/locale/it/episodes/08-plot-ggplot2.Rmd @@ -0,0 +1,471 @@ +--- +title: Creating Publication-Quality Graphics with ggplot2 +teaching: 60 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use ggplot2 to generate publication-quality graphics. +- To apply geometry, aesthetic, and statistics layers to a ggplot plot. +- To manipulate the aesthetics of a plot using different colors, shapes, and lines. +- To improve data visualization through transforming scales and paneling by group. +- To save a plot created with ggplot to disk. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I create publication-quality graphics in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Plotting our data is one of the best ways to +quickly explore it and the various relationships +between variables. + +There are three main plotting systems in R, +the [base plotting system][base], the [lattice] +package, and the [ggplot2] package. + +Today we'll be learning about the ggplot2 package, because +it is the most effective for creating publication-quality +graphics. + +ggplot2 is built on the grammar of graphics, the idea that any plot can be +built from the same set of components: a **data set**, +**mapping aesthetics**, and graphical **layers**: + +- **Data sets** are the data that you, the user, provide. + +- **Mapping aesthetics** are what connect the data to the graphics. + They tell ggplot2 how to use your data to affect how the graph looks, + such as changing what is plotted on the X or Y axis, or the size or + color of different data points. + +- **Layers** are the actual graphical output from ggplot2. Layers + determine what kinds of plot are shown (scatterplot, histogram, etc.), + the coordinate system used (rectangular, polar, others), and other + important aspects of the plot. The idea of layers of graphics may + be familiar to you if you have used image editing programs + like Photoshop, Illustrator, or Inkscape. + +Let's start off building an example using the gapminder data from earlier. +The most basic function is `ggplot`, which lets R know that we're +creating a new plot. Any of the arguments we give the `ggplot` +function are the _global_ options for the plot: they apply to all +layers on the plot. + +```{r blank-ggplot, message=FALSE, fig.alt="Blank plot, before adding any mapping aesthetics to ggplot()."} +library("ggplot2") +ggplot(data = gapminder) +``` + +Here we called `ggplot` and told it what data we want to show on +our figure. This is not enough information for `ggplot` to actually +draw anything. It only creates a blank slate for other elements +to be added to. + +Now we're going to add in the **mapping aesthetics** using the +`aes` function. `aes` tells `ggplot` how variables in the **data** +map to _aesthetic_ properties of the figure, such as which columns +of the data should be used for the **x** and **y** locations. + +```{r ggplot-with-aes, message=FALSE, fig.alt="Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +``` + +Here we told `ggplot` we want to plot the "gdpPercap" column of the +gapminder data frame on the x-axis, and the "lifeExp" column on the +y-axis. Notice that we didn't need to explicitly pass `aes` these +columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because +`ggplot` is smart enough to know to look in the **data** for that column! + +The final part of making our plot is to tell `ggplot` how we want to +visually represent the data. We do this by adding a new **layer** +to the plot using one of the **geom** functions. + +```{r lifeExp-vs-gdpPercap-scatter, message=FALSE, fig.alt="Scatter plot of life expectancy vs GDP per capita, now showing the data points."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Here we used `geom_point`, which tells `ggplot` we want to visually +represent the relationship between **x** and **y** as a scatterplot of points. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Modify the example so that the figure shows how life expectancy has +changed over time: + +```{r, eval=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() +``` + +Hint: the gapminder dataset has a column called "year", which should appear +on the x-axis. + +::::::::::::::: solution + +## Solution to challenge 1 + +Here is one possible solution: + +```{r ch1-sol, fig.cap="Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +In the previous examples and challenge we've used the `aes` function to tell +the scatterplot **geom** about the **x** and **y** locations of each point. +Another _aesthetic_ property we can modify is the point _color_. Modify the +code from the previous challenge to **color** the points by the "continent" +column. What trends do you see in the data? Are they what you expected? + +::::::::::::::: solution + +## Solution to challenge 2 + +The solution presented below adds `color=continent` to the call of the `aes` +function. The general trend seems to indicate an increased life expectancy +over the years. On continents with stronger economies we find a longer life +expectancy. + +```{r ch2-sol, fig.cap="Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Layers + +Using a scatterplot probably isn't the best for visualizing change over time. +Instead, let's tell `ggplot` to visualize the data as a line plot: + +```{r lifeExp-line} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) + + geom_line() +``` + +Instead of adding a `geom_point` layer, we've added a `geom_line` layer. + +However, the result doesn't look quite as we might have expected: it seems to be jumping around a lot in each continent. Let's try to separate the data by country, plotting one line for each country: + +```{r lifeExp-line-by} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() +``` + +We've added the **group** _aesthetic_, which tells `ggplot` to draw a line for each +country. + +But what if we want to visualize both lines and points on the plot? We can +add another layer to the plot: + +```{r lifeExp-line-point} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() + geom_point() +``` + +It's important to note that each layer is drawn on top of the previous layer. In +this example, the points have been drawn _on top of_ the lines. Here's a +demonstration: + +```{r lifeExp-layer-example-1} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_line(mapping = aes(color=continent)) + geom_point() +``` + +In this example, the _aesthetic_ mapping of **color** has been moved from the +global plot options in `ggplot` to the `geom_line` layer so it no longer applies +to the points. Now we can clearly see that the points are drawn on top of the +lines. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Setting an aesthetic to a value instead of a mapping + +So far, we've seen how to use an aesthetic (such as **color**) as a _mapping_ to a variable in the data. For example, when we use `geom_line(mapping = aes(color=continent))`, ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that `geom_line(mapping = aes(color="blue"))` should work, but it doesn't. Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the `aes()` function, like this: `geom_line(color="blue")`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Switch the order of the point and line layers from the previous example. What +happened? + +::::::::::::::: solution + +## Solution to challenge 3 + +The lines now get drawn over the points! + +```{r ch3-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_point() + geom_line(mapping = aes(color=continent)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Transformations and statistics + +ggplot2 also makes it easy to overlay statistical models over the data. To +demonstrate we'll go back to our first example: + +```{r lifeExp-vs-gdpPercap-scatter3, message=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Currently it's hard to see the relationship between the points due to some strong +outliers in GDP per capita. We can change the scale of units on the x axis using +the _scale_ functions. These control the mapping between the data values and +visual values of an aesthetic. We can also modify the transparency of the +points, using the _alpha_ function, which is especially helpful when you have +a large amount of data which is very clustered. + +```{r axis-scale, fig.cap="Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread"} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() +``` + +The `scale_x_log10` function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip Reminder: Setting an aesthetic to a value instead of a mapping + +Notice that we used `geom_point(alpha = 0.5)`. As the previous tip mentioned, using a setting outside of the `aes()` function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, _alpha_ can also be mapped to a variable in the data. For example, we can give a different transparency to each continent with `geom_point(mapping = aes(alpha = continent))`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can fit a simple relationship to the data by adding another layer, +`geom_smooth`: + +```{r lm-fit, fig.alt="Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm") +``` + +We can make the line thicker by _setting_ the **linewidth** aesthetic in the +`geom_smooth` layer: + +```{r lm-fit2, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5) +``` + +There are two ways an _aesthetic_ can be specified. Here we _set_ the **linewidth** aesthetic by passing it as an argument to `geom_smooth` and it is applied the same to the whole `geom`. Previously in the lesson we've used the `aes` function to define a _mapping_ between data variables and their visual representation. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4a + +Modify the color and size of the points on the point layer in the previous +example. + +Hint: do not use the `aes` function. + +Hint: the equivalent of `linewidth` for points is `size`. + +::::::::::::::: solution + +## Solution to challenge 4a + +Here a possible solution: +Notice that the `color` argument is supplied outside of the `aes()` function. +This means that it applies to all data points on the graph and is not related to +a specific variable. + +```{r ch4a-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(size=3, color="orange") + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4b + +Modify your solution to Challenge 4a so that the +points are now a different shape and are colored by continent with new +trendlines. Hint: The color argument can be used inside the aesthetic. + +::::::::::::::: solution + +## Solution to challenge 4b + +Here is a possible solution: +Notice that supplying the `color` argument inside the `aes()` functions enables you to +connect it to a certain variable. The `shape` argument, as you can see, modifies all +data points the same way (it is outside the `aes()` call) while the `color` argument which +is placed inside the `aes()` call modifies a point's color based on its continent value. + +```{r ch4b-sol} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point(size=3, shape=17) + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Multi-panel figures + +Earlier we visualized the change in life expectancy over time across all +countries in one plot. Alternatively, we can split this out over multiple panels +by adding a layer of **facet** panels. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to +clutter the figure. Note that we apply a "theme" definition to rotate +the x-axis labels to maintain readability. Nearly everything in +ggplot2 is customizable. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r facet} +americas <- gapminder[gapminder$continent == "Americas",] +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +The `facet_wrap` layer took a "formula" as its argument, denoted by the tilde +(~). This tells R to draw a panel for each unique value in the country column +of the gapminder dataset. + +## Modifying text + +To clean this figure up for a publication we need to change some of the text +elements. The x-axis is too cluttered, and the y axis should read +"Life expectancy", rather than the column name in the data frame. + +We can do this by adding a couple of different layers. The **theme** layer +controls the axis text, and overall text size. Labels for the axes, plot +title and any legend can be set using the `labs` function. Legend titles +are set using the same names we used in the `aes` specification. Thus below +the color legend title is set using `color = "Continent"`, while the title +of a fill legend would be set using `fill = "MyTitle"`. + +```{r theme} +ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + +## Exporting the plot + +The `ggsave()` function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable `lifeExp_plot`, then tell `ggsave` to save that plot in `png` format to a directory called `results`. (Make sure you have a `results/` folder in your working directory.) + +```{r directory-check, echo=FALSE} +if (!dir.exists("results")) { + dir.create("results") +} +``` + +```{r save} +lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + +ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm") +``` + +There are two nice things about `ggsave`. First, it defaults to the last plot, so if you omit the `plot` argument it will automatically save the last plot you created with `ggplot`. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example `.png` or `.pdf`). If you need to, you can specify the format explicitly in the `device` argument. + +This is a taste of what you can do with ggplot2. RStudio provides a +really useful [cheat sheet][cheat] of the different layers available, and more +extensive documentation is available on the [ggplot2 website][ggplot-doc]. All RStudio cheat sheets are available from the [RStudio website][cheat_all]. +Finally, if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow with reusable +code to modify! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Generate boxplots to compare life expectancy between the different continents during the available years. + +Advanced: + +- Rename y axis as Life Expectancy. +- Remove x axis labels. + +::::::::::::::: solution + +## Solution to Challenge 5 + +Here a possible solution: +`xlab()` and `ylab()` set labels for the x and y axes, respectively +The axis title, text and ticks are attributes of the theme and must be modified within a `theme()` call. + +```{r ch5-sol} +ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) + + geom_boxplot() + facet_wrap(~year) + + ylab("Life Expectancy") + + theme(axis.title.x=element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[base]: https://www.statmethods.net/graphs/index.html +[lattice]: https://www.statmethods.net/advgraphs/trellis.html +[ggplot2]: https://www.statmethods.net/advgraphs/ggplot2.html +[cheat]: https://www.rstudio.org/links/data_visualization_cheat_sheet +[cheat_all]: https://www.rstudio.com/resources/cheatsheets/ +[ggplot-doc]: https://ggplot2.tidyverse.org/reference/ + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `ggplot2` to create plots. +- Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/09-vectorization.Rmd b/locale/it/episodes/09-vectorization.Rmd new file mode 100644 index 000000000..9cae732ed --- /dev/null +++ b/locale/it/episodes/09-vectorization.Rmd @@ -0,0 +1,332 @@ +--- +title: Vectorization +teaching: 10 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand vectorized operations in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I operate on all the elements of a vector at once? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +library("ggplot2") +``` + +Most of R's functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through +and act on each element one at a time. This makes writing code more +concise, easy to read, and less error prone. + +```{r} +x <- 1:4 +x * 2 +``` + +The multiplication happened to each element of the vector. + +We can also add two vectors together: + +```{r} +y <- 6:9 +x + y +``` + +Each element of `x` was added to its corresponding element of `y`: + +```{r, eval=FALSE} +x: 1 2 3 4 + + + + + +y: 6 7 8 9 +--------------- + 7 9 11 13 +``` + +Here is how we would add two vectors together using a for loop: + +```{r} +output_vector <- c() +for (i in 1:4) { + output_vector[i] <- x[i] + y[i] +} +output_vector + + +``` + +Compare this to the output using vectorised operations. + +```{r} +sum_xy <- x + y +sum_xy +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +::::::::::::::: solution + +## Solution to challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +```{r} +gapminder$pop_millions <- gapminder$pop / 1e6 +head(gapminder) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +On a single graph, plot population, in +millions, against year, for all countries. Do not worry about +identifying which country is which. + +Repeat the exercise, graphing only for China, India, and +Indonesia. Again, do not worry about which is which. + +::::::::::::::: solution + +## Solution to challenge 2 + +Refresh your plotting skills by plotting population in millions against year. + +```{r ch2-sol, fig.alt="Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled."} +ggplot(gapminder, aes(x = year, y = pop_millions)) + + geom_point() +countryset <- c("China","India","Indonesia") +ggplot(gapminder[gapminder$country %in% countryset,], + aes(x = year, y = pop_millions)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Comparison operators, logical operators, and many functions are also +vectorized: + +**Comparison operators** + +```{r} +x > 2 +``` + +**Logical operators** + +```{r} +a <- x > 3 # or, for clarity, a <- (x > 3) +a +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: some useful functions for logical vectors + +`any()` will return `TRUE` if _any_ element of a vector is `TRUE`.\ +`all()` will return `TRUE` if _all_ elements of a vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Most functions also operate element-wise on vectors: + +**Functions** + +```{r} +x <- 1:4 +log(x) +``` + +Vectorized operations work element-wise on matrices: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m * -1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: element-wise vs. matrix multiplication + +Very important: the operator `*` gives you element-wise multiplication! +To do matrix multiplication, we need to use the `%*%` operator: + +```{r} +m %*% matrix(1, nrow=4, ncol=1) +matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1) +``` + +For more on matrix algebra, see the Quick-R reference +guide + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` +2. `m * c(1, 0, -1)` +3. `m > c(0, 20)` +4. `m * c(1, 0, -1, 2)` + +Did you get the output you expected? If not, ask a helper! + +::::::::::::::: solution + +## Solution to challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` + +```{r, echo=FALSE} +m ^ -1 +``` + +2. `m * c(1, 0, -1)` + +```{r, echo=FALSE} +m * c(1, 0, -1) +``` + +3. `m > c(0, 20)` + +```{r, echo=FALSE} +m > c(0, 20) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000? + +::::::::::::::: solution + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for +high values of n. +Can you use vectorisation to compute x, when n=100? +How about when n=10,000? + +```{r} +sum(1/(1:100)^2) +sum(1/(1:1e04)^2) +n <- 10000 +sum(1/(1:n)^2) +``` + +We can also obtain the same results using a function: + +```{r} +inverse_sum_of_squares <- function(n) { + sum(1/(1:n)^2) +} +inverse_sum_of_squares(100) +inverse_sum_of_squares(10000) +n <- 10000 +inverse_sum_of_squares(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Operations on vectors of unequal length + +Operations can also be performed on vectors of unequal length, through +a process known as _recycling_. This process automatically repeats the smaller vector +until it matches the length of the larger vector. R will provide a warning +if the larger vector is not a multiple of the smaller vector. + +```{r} +x <- c(1, 2, 3) +y <- c(1, 2, 3, 4, 5, 6, 7) +x + y +``` + +Vector `x` was recycled to match the length of vector `y` + +```{r, eval=FALSE} +x: 1 2 3 1 2 3 1 + + + + + + + + +y: 1 2 3 4 5 6 7 +----------------------- + 2 4 6 5 7 9 8 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use vectorized operations instead of loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/10-functions.Rmd b/locale/it/episodes/10-functions.Rmd new file mode 100644 index 000000000..ba405661f --- /dev/null +++ b/locale/it/episodes/10-functions.Rmd @@ -0,0 +1,590 @@ +--- +title: Functions Explained +teaching: 45 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define a function that takes arguments. +- Return a value from a function. +- Check argument conditions with `stopifnot()` in functions. +- Test a function. +- Set default values for function arguments. +- Explain why we should divide programs into small, single-purpose functions. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write a new function in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +If we only had one data set to analyze, it would probably be faster to load the +file into a spreadsheet and use that to plot simple statistics. However, the +gapminder data is updated periodically, and we may want to pull in that new +information later and re-run our analysis again. We may also obtain similar data +from a different source in the future. + +In this lesson, we'll learn how to write a function so that we can repeat +several operations with a single command. + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is a function? + +Functions gather a sequence of operations into a whole, preserving it for +ongoing use. Functions provide: + +- a name we can remember and invoke it by +- relief from the need to remember the individual operations +- a defined set of inputs and expected outputs +- rich connections to the larger programming environment + +As the basic building block of most programming languages, user-defined +functions constitute "programming" as much as any single abstraction can. If +you have written a function, you are a computer programmer. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Defining a function + +Let's open a new R script file in the `functions/` directory and call it +functions-lesson.R. + +The general structure of a function is: + +```{r} +my_function <- function(parameters) { + # perform action + # return value +} +``` + +Let's define a function `fahr_to_kelvin()` that converts temperatures from +Fahrenheit to Kelvin: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +We define `fahr_to_kelvin()` by assigning it to the output of `function`. The +list of argument names are contained within parentheses. Next, the +[body](../learners/reference.md#body) of the function--the +statements that are executed when it runs--is contained within curly braces +(`{}`). The statements in the body are indented by two spaces. This makes the +code easier to read but does not affect how the code operates. + +It is useful to think of creating functions like writing a cookbook. First you define the "ingredients" that your function needs. In this case, we only need one ingredient to use our function: "temp". After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it. + +When we call the function, the values we pass to it as arguments are assigned to +those variables so that we can use them inside the function. Inside the +function, we use a return +statement to send a result back to +whoever asked for it. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the body +of the function. But for clarity, we will explicitly define the +return statement. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's try running our function. +Calling our own function is no different from calling any other function: + +```{r} +# freezing point of water +fahr_to_kelvin(32) +``` + +```{r} +# boiling point of water +fahr_to_kelvin(212) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a function called `kelvin_to_celsius()` that takes a temperature in +Kelvin and returns that temperature in Celsius. + +Hint: To convert from Kelvin to Celsius you subtract 273.15 + +::::::::::::::: solution + +## Solution to challenge 1 + +Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +and returns that temperature in Celsius + +```{r} +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Combining functions + +The real power of functions comes from mixing, matching and combining them +into ever-larger chunks to get the effect we want. + +Let's define two functions that will convert temperature from Fahrenheit to +Kelvin, and Kelvin to Celsius: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} + +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer). + +::::::::::::::: solution + +## Solution to challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above + +```{r} +fahr_to_celsius <- function(temp) { + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Interlude: Defensive Programming + +Now that we've begun to appreciate how writing functions provides an efficient +way to make R code re-usable and modular, we should note that it is important +to ensure that functions only work in their intended use-cases. Checking +function parameters is related to the concept of _defensive programming_. +Defensive programming encourages us to frequently check conditions and throw an +error if something is wrong. These checks are referred to as assertion +statements because we want to assert some condition is `TRUE` before proceeding. +They make it easier to debug because they give us a better idea of where the +errors originate. + +### Checking conditions with `stopifnot()` + +Let's start by re-examining `fahr_to_kelvin()`, our function for converting +temperatures from Fahrenheit to Kelvin. It was defined like so: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +For this function to work as intended, the argument `temp` must be a `numeric` +value; otherwise, the mathematical procedure for converting between the two +temperature scales will not work. To create an error, we can use the function +`stop()`. For example, since the argument `temp` must be a `numeric` vector, we +could check for this condition with an `if` statement and throw an error if the +condition was violated. We could augment our function above like so: + +```{r} +fahr_to_kelvin <- function(temp) { + if (!is.numeric(temp)) { + stop("temp must be a numeric vector.") + } + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +If we had multiple conditions or arguments to check, it would take many lines +of code to check all of them. Luckily R provides the convenience function +`stopifnot()`. We can list as many requirements that should evaluate to `TRUE`; +`stopifnot()` throws an error if it finds one that is `FALSE`. Listing these +conditions also serves a secondary purpose as extra documentation for the +function. + +Let's try out defensive programming with `stopifnot()` by adding assertions to +check the input to our function `fahr_to_kelvin()`. + +We want to assert the following: `temp` is a numeric vector. We may do that like +so: + +```{r} +fahr_to_kelvin <- function(temp) { + stopifnot(is.numeric(temp)) + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +It still works when given proper input. + +```{r} +# freezing point of water +fahr_to_kelvin(temp = 32) +``` + +But fails instantly if given improper input. + +```{r} +# Metric is a factor instead of numeric +fahr_to_kelvin(temp = as.factor(32)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use defensive programming to ensure that our `fahr_to_celsius()` function +throws an error immediately if the argument `temp` is specified +inappropriately. + +::::::::::::::: solution + +## Solution to challenge 3 + +Extend our previous definition of the function by adding in an explicit call +to `stopifnot()`. Since `fahr_to_celsius()` is a composition of two other +functions, checking inside here makes adding checks to the two component +functions redundant. + +```{r} +fahr_to_celsius <- function(temp) { + stopifnot(is.numeric(temp)) + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## More on combining functions + +Now, we're going to define a function that calculates the Gross Domestic Product +of a nation from the data available in our dataset: + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat) { + gdp <- dat$pop * dat$gdpPercap + return(gdp) +} +``` + +We define `calcGDP()` by assigning it to the output of `function`. The list of +argument names are contained within parentheses. Next, the body of the function +\-- the statements executed when you call the function -- is contained within +curly braces (`{}`). + +We've indented the statements in the body by two spaces. This makes the code +easier to read but does not affect how it operates. + +When we call the function, the values we pass to it are assigned to the +arguments, which become variables inside the body of the function. + +Inside the function, we use the `return()` function to send back the result. +This `return()` function is optional: R will automatically return the results of +whatever command is executed on the last line of the function. + +```{r} +calcGDP(head(gapminder)) +``` + +That's not very informative. Let's add some more arguments so we can extract +that per year and country. + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat, year=NULL, country=NULL) { + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } + gdp <- dat$pop * dat$gdpPercap + + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +If you've been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by using the +`source()` function: + +```{r, eval=FALSE} +source("functions/functions-lesson.R") +``` + +Ok, so there's a lot going on in this function now. In plain English, the +function now subsets the provided data by year if the year argument isn't empty, +then subsets the result by country if the country argument isn't empty. Then it +calculates the GDP for whatever subset emerges from the previous two steps. The +function then adds the GDP as a new column to the subsetted data and returns +this as the final result. You can see that the output is much more informative +than a vector of numbers. + +Let's take a look at what happens when we specify the year: + +```{r} +head(calcGDP(gapminder, year=2007)) +``` + +Or for a specific country: + +```{r} +calcGDP(gapminder, country="Australia") +``` + +Or both: + +```{r} +calcGDP(gapminder, year=2007, country="Australia") +``` + +Let's walk through the body of the function: + +```{r, eval=FALSE} +calcGDP <- function(dat, year=NULL, country=NULL) { +``` + +Here we've added two arguments, `year`, and `country`. We've set +_default arguments_ for both as `NULL` using the `=` operator +in the function definition. This means that those arguments will +take on those values unless the user specifies otherwise. + +```{r, eval=FALSE} + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } +``` + +Here, we check whether each additional argument is set to `null`, and whenever +they're not `null` overwrite the dataset stored in `dat` with a subset given by +the non-`null` argument. + +Building these conditionals into the function makes it more flexible for later. +Now, we can use it to calculate the GDP for: + +- The whole dataset; +- A single year; +- A single country; +- A single combination of year and country. + +By using `%in%` instead, we can also give multiple years or countries to those +arguments. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Pass by value + +Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify `dat` inside the function +we are modifying the copy of the gapminder dataset stored in `dat`, +not the original variable we gave as the first argument. + +This is called "pass-by-value" and it makes writing code much safer: +you can always be sure that whatever changes you make within the +body of the function, stay inside the body of the function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Function scope + +Another important concept is scoping: any variables (or functions!) you +create or modify inside the body of a function only exist for the lifetime +of the function's execution. When we call `calcGDP()`, the variables `dat`, +`gdp` and `new` only exist inside the body of the function. Even if we +have variables of the same name in our interactive R session, they are +not modified in any way when executing a function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, eval=FALSE} + gdp <- dat$pop * dat$gdpPercap + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +Finally, we calculated the GDP on our new subset, and created a new data frame +with that column added. This means when we call the function later we can see +the context for the returned GDP values, which is much better than in our first +attempt where we got a vector of numbers. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Test out your GDP function by calculating the GDP for New Zealand in 1987. How +does this differ from New Zealand's GDP in 1952? + +::::::::::::::: solution + +## Solution to challenge 4 + +```{r, eval=FALSE} + calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand") +``` + +GDP for New Zealand in 1987: 65050008703 + +GDP for New Zealand in 1952: 21058193787 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +The `paste()` function can be used to combine text together, e.g: + +```{r} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +paste(best_practice, collapse=" ") +``` + +Write a function called `fence()` that takes two vectors as arguments, called +`text` and `wrapper`, and prints out the text wrapped with the `wrapper`: + +```{r, eval=FALSE} +fence(text=best_practice, wrapper="***") +``` + +_Note:_ the `paste()` function has an argument called `sep`, which specifies +the separator between text. The default is a space: " ". The default for +`paste0()` is no space "". + +::::::::::::::: solution + +## Solution to challenge 5 + +Write a function called `fence()` that takes two vectors as arguments, +called `text` and `wrapper`, and prints out the text wrapped with the +`wrapper`: + +```{r} +fence <- function(text, wrapper){ + text <- c(wrapper, text, wrapper) + result <- paste(text, collapse = " ") + return(result) +} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +fence(text=best_practice, wrapper="***") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the +[R Language Manual][man] or this [chapter] from +[Advanced R Programming][adv-r] by Hadley Wickham. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Testing and documenting + +It's important to both test functions and document them: +Documentation helps you, and others, understand what the +purpose of your function is, and how to use it, and its +important to make sure that your function actually does +what you think. + +When you first start out, your workflow will probably look a lot +like this: + +1. Write a function +2. Comment parts of the function to document its behaviour +3. Load in the source file +4. Experiment with it in the console to make sure it behaves + as you expect +5. Make any necessary bug fixes +6. Rinse and repeat. + +Formal documentation for functions, written in separate `.Rd` +files, gets turned into the documentation you see in help +files. The [roxygen2] package allows R coders to write documentation +alongside the function code and then process it into the appropriate `.Rd` +files. You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In fact, +packages are, in essence, bundles of functions with this formal documentation. +Loading your own functions through `source("functions.R")` is equivalent to +loading someone else's functions (or your own one day!) through +`library("package")`. + +Formal automated tests can be written using the [testthat] package. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[man]: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects +[chapter]: https://adv-r.had.co.nz/Environments.html +[adv-r]: https://adv-r.had.co.nz/ +[roxygen2]: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html +[testthat]: https://r-pkgs.had.co.nz/tests.html + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `function` to define a new function in R. +- Use parameters to pass values into functions. +- Use `stopifnot()` to flexibly check function arguments in R. +- Load functions into programs using `source()`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/11-writing-data.Rmd b/locale/it/episodes/11-writing-data.Rmd new file mode 100644 index 000000000..646e11b7e --- /dev/null +++ b/locale/it/episodes/11-writing-data.Rmd @@ -0,0 +1,188 @@ +--- +title: Writing Data +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to write out plots and data from R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I save plots and data created in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +library("ggplot2") +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +dir.create("cleaned-data") +``` + +## Saving plots + +You have already seen how to save the most recent plot you create in `ggplot2`, +using the command `ggsave`. As a refresher: + +```{r, eval=FALSE} +ggsave("My_most_recent_plot.pdf") +``` + +You can save a plot from within RStudio using the 'Export' button +in the 'Plot' window. This will give you the option of saving as a +.pdf or as .png, .jpg or other image formats. + +Sometimes you will want to save plots without creating them in the +'Plot' window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you're looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can't stop +the loop to click 'Export' for each one. + +In this case you can use a more flexible approach. The function +`pdf` creates a new pdf device. You can control the size and resolution +using the arguments to this function. + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width=12, height=4) +ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) + + geom_line() + + theme(legend.position = "none") + +# You then have to make sure to turn off the pdf device! + +dev.off() +``` + +Open up this document and have a look. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Rewrite your 'pdf' command to print a second +page in the pdf, showing a facet plot (hint: use `facet_grid`) +of the same data with one panel per continent. + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width = 12, height = 4) +p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) + + geom_line() + + theme(legend.position = "none") +p +p + facet_grid(~continent) +dev.off() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The commands `jpeg`, `png` etc. are used similarly to produce +documents in different formats. + +## Writing data + +At some point, you'll also want to write out data from R. + +We can use the `write.table` function for this, which is +very similar to `read.table` from before. + +Let's create a data-cleaning script, for this analysis, we +only want to focus on the gapminder data for Australia: + +```{r} +aust_subset <- gapminder[gapminder$country == "Australia",] + +write.table(aust_subset, + file="cleaned-data/gapminder-aus.csv", + sep="," +) +``` + +Let's switch back to the shell to take a look at the data to make sure it looks +OK: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +Hmm, that's not quite what we wanted. Where did all these +quotation marks come from? Also the row numbers are +meaningless. + +Let's look at the help file to work out how to change this +behaviour. + +```{r, eval=FALSE} +?write.table +``` + +By default R will wrap character vectors with quotation marks +when writing out to file. It will also write out the row and +column names. + +Let's fix this: + +```{r} +write.table( + gapminder[gapminder$country == "Australia",], + file="cleaned-data/gapminder-aus.csv", + sep=",", quote=FALSE, row.names=FALSE +) +``` + +Now lets look at the data again using our shell skills: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +That looks better! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Write a data-cleaning script file that subsets the gapminder +data to include only data points collected since 1990. + +Use this script to write out the new subset to a file +in the `cleaned-data/` directory. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r, eval=FALSE} +write.table( + gapminder[gapminder$year > 1990, ], + file = "cleaned-data/gapminder-after1990.csv", + sep = ",", quote = FALSE, row.names = FALSE +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE} +# We remove after rendering the lesson, because we don't want this in the lesson +# repository +unlink("cleaned-data", recursive=TRUE) +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Save plots from RStudio using the 'Export' button. +- Use `write.table` to save tabular data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/12-dplyr.Rmd b/locale/it/episodes/12-dplyr.Rmd new file mode 100644 index 000000000..0f5540883 --- /dev/null +++ b/locale/it/episodes/12-dplyr.Rmd @@ -0,0 +1,487 @@ +--- +title: Data Frame Manipulation with dplyr +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use the six main data frame manipulation 'verbs' with pipes in `dplyr`. +- To understand how `group_by()` and `summarize()` can be combined to summarize datasets. +- Be able to analyze a subset of data using logical filtering. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate data frames without repeating myself? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Manipulation of data frames means many things to many researchers: we often +select certain observations (rows) or variables (columns), we often group the +data by a certain variable(s), or we even calculate summary statistics. We can +do these operations using the normal base R operations: + +```{r} +mean(gapminder$gdpPercap[gapminder$continent == "Africa"]) +mean(gapminder$gdpPercap[gapminder$continent == "Americas"]) +mean(gapminder$gdpPercap[gapminder$continent == "Asia"]) +``` + +But this isn't very _nice_ because there is a fair bit of repetition. Repeating +yourself will cost you time, both now and later, and potentially introduce some +nasty bugs. + +## The `dplyr` package + +Luckily, the [`dplyr`](https://cran.r-project.org/package=dplyr) +package provides a number of very useful functions for manipulating data frames +in a way that will reduce the above repetition, reduce the probability of making +errors, and probably even save you some typing. As an added bonus, you might +even find the `dplyr` grammar easier to read. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Tidyverse + +`dplyr` package belongs to a broader family of opinionated R packages +designed for data science called the "Tidyverse". These +packages are specifically designed to work harmoniously together. +Some of these packages will be covered along this course, but you can find more +complete information here: [https://www.tidyverse.org/](https://www.tidyverse.org/). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Here we're going to cover 5 of the most commonly used functions as well as using +pipes (`%>%`) to combine them. + +1. `select()` +2. `filter()` +3. `group_by()` +4. `summarize()` +5. `mutate()` + +If you have have not installed this package earlier, please do so: + +```{r, eval=FALSE} +install.packages('dplyr') +``` + +Now let's load the package: + +```{r, message=FALSE} +library("dplyr") +``` + +## Using select() + +If, for example, we wanted to move forward with only a few of the variables in +our data frame we could use the `select()` function. This will keep only the +variables you select. + +```{r} +year_country_gdp <- select(gapminder, year, country, gdpPercap) +``` + +![](fig/13-dplyr-fig1.png){alt='Diagram illustrating use of select function to select two columns of a data frame'} +If we want to remove one column only from the `gapminder` data, for example, +removing the `continent` column. + +```{r} +smaller_gapminder_data <- select(gapminder, -continent) +``` + +If we open up `year_country_gdp` we'll see that it only contains the year, +country and gdpPercap. Above we used 'normal' grammar, but the strengths of +`dplyr` lie in combining several functions using pipes. Since the pipes grammar +is unlike anything we've seen in R before, let's repeat what we've done above +using pipes. + +```{r} +year_country_gdp <- gapminder %>% select(year, country, gdpPercap) +``` + +To help you understand why we wrote that in that way, let's walk through it step +by step. First we summon the gapminder data frame and pass it on, using the pipe +symbol `%>%`, to the next step, which is the `select()` function. In this case +we don't specify which data object we use in the `select()` function since in +gets that from the previous pipe. **Fun Fact**: There is a good chance you have +encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the +shell it is `|` but the concept is the same! + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Renaming data frame columns in dplyr + +In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the `names()` function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a `rename()` function. + +Within a pipeline, the syntax is `rename(new_name = old_name)`. +For example, we may want to rename the gdpPercap column name from our `select()` statement above. + +```{r} +tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap) + +head(tidy_gdp) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Using filter() + +If we now want to move forward with the above, but only with European +countries, we can combine `select` and `filter` + +```{r} +year_country_gdp_euro <- gapminder %>% + filter(continent == "Europe") %>% + select(year, country, gdpPercap) +``` + +If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below. + +```{r} +europe_lifeExp_2007 <- gapminder %>% + filter(continent == "Europe", year == 2007) %>% + select(country, lifeExp) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a single command (which can span multiple lines and includes pipes) that +will produce a data frame that has the African values for `lifeExp`, `country` +and `year`, but not for other Continents. How many rows does your data frame +have and why? + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +year_country_lifeExp_Africa <- gapminder %>% + filter(continent == "Africa") %>% + select(year, country, lifeExp) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +As with last time, first we pass the gapminder data frame to the `filter()` +function, then we pass the filtered version of the gapminder data frame to the +`select()` function. **Note:** The order of operations is very important in this +case. If we used 'select' first, filter would not be able to find the variable +continent since we would have removed it in the previous step. + +## Using group\_by() + +Now, we were supposed to be reducing the error prone repetitiveness of what can +be done with base R, but up to now we haven't done that since we would have to +repeat the above for each continent. Instead of `filter()`, which will only pass +observations that meet your criteria (in the above: `continent=="Europe"`), we +can use `group_by()`, which will essentially use every unique criteria that you +could have used in filter. + +```{r} +str(gapminder) + +str(gapminder %>% group_by(continent)) +``` + +You will notice that the structure of the data frame where we used `group_by()` +(`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A +`grouped_df` can be thought of as a `list` where each item in the `list`is a +`data.frame` which contains only the rows that correspond to the a particular +value `continent` (at least in the example above). + +![](fig/13-dplyr-fig2.png){alt='Diagram illustrating how the group by function oraganizes a data frame into groups'} + +## Using summarize() + +The above was a bit on the uneventful side but `group_by()` is much more +exciting in conjunction with `summarize()`. This will allow us to create new +variable(s) by using functions that repeat for each of the continent-specific +data frames. That is to say, using the `group_by()` function, we split our +original data frame into multiple pieces, then we can run functions +(e.g. `mean()` or `sd()`) within `summarize()`. + +```{r} +gdp_bycontinents <- gapminder %>% + group_by(continent) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +![](fig/13-dplyr-fig3.png){alt='Diagram illustrating the use of group by and summarize together to create a new variable'} + +```{r, eval=FALSE} +continent mean_gdpPercap + +1 Africa 2193.755 +2 Americas 7136.110 +3 Asia 7902.150 +4 Europe 14469.476 +5 Oceania 18621.609 +``` + +That allowed us to calculate the mean gdpPercap for each continent, but it gets +even better. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Calculate the average life expectancy per country. Which has the longest average life +expectancy and which has the shortest average life expectancy? + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +lifeExp_bycountry <- gapminder %>% + group_by(country) %>% + summarize(mean_lifeExp = mean(lifeExp)) +lifeExp_bycountry %>% + filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp)) +``` + +Another way to do this is to use the `dplyr` function `arrange()`, which +arranges the rows in a data frame according to the order of one or more +variables from the data frame. It has similar syntax to other functions from +the `dplyr` package. You can use `desc()` inside `arrange()` to sort in +descending order. + +```{r} +lifeExp_bycountry %>% + arrange(mean_lifeExp) %>% + head(1) +lifeExp_bycountry %>% + arrange(desc(mean_lifeExp)) %>% + head(1) +``` + +Alphabetical order works too + +```{r} +lifeExp_bycountry %>% + arrange(desc(country)) %>% + head(1) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::: + +The function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`. + +```{r} +gdp_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop)) +``` + +## count() and n() + +A very common operation is to count the number of observations for each +group. The `dplyr` package comes with two related functions that help with this. + +For instance, if we wanted to check the number of countries included in the +dataset for the year 2002, we can use the `count()` function. It takes the name +of one or more columns that contain the groups we are interested in, and we can +optionally sort the results in descending order by adding `sort=TRUE`: + +```{r} +gapminder %>% + filter(year == 2002) %>% + count(continent, sort = TRUE) +``` + +If we need to use the number of observations in calculations, the `n()` function +is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize(se_le = sd(lifeExp)/sqrt(n())) +``` + +You can also chain together several summary operations; in this case calculating the `minimum`, `maximum`, `mean` and `se` of each continent's per-country life-expectancy: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize( + mean_le = mean(lifeExp), + min_le = min(lifeExp), + max_le = max(lifeExp), + se_le = sd(lifeExp)/sqrt(n())) +``` + +## Using mutate() + +We can also create new variables prior to (or even after) summarizing information using `mutate()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + mutate(gdp_billion = gdpPercap*pop/10^9) %>% + group_by(continent,year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) +``` + +## Connect mutate with logical filtering: ifelse + +When creating new variables, we can hook this with a logical condition. A simple combination of +`mutate()` and `ifelse()` facilitates filtering right where it is needed: in the moment of creating something new. +This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension +of the data frame will not change) or for updating values depending on this given condition. + +```{r} +## keeping all data but "filtering" after a certain condition +# calculate GDP only for people with a life expectation above 25 +gdp_pop_bycontinents_byyear_above25 <- gapminder %>% + mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) + +## updating only if certain condition is fullfilled +# for life expectations above 40 years, the gpd to be expected in the future is scaled +gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>% + mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + mean_gdpPercap_expected = mean(gdp_futureExpectation)) +``` + +## Combining `dplyr` and `ggplot2` + +First install and load ggplot2: + +```{r, eval=FALSE} +install.packages('ggplot2') +``` + +```{r, message=FALSE} +library("ggplot2") +``` + +In the plotting lesson we looked at how to make a multi-panel figure by adding +a layer of facet panels using `ggplot2`. Here is the code we used (with some +extra comments): + +```{r} +# Filter countries located in the Americas +americas <- gapminder[gapminder$continent == "Americas", ] +# Make the plot +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +This code makes the right plot but it also creates an intermediate variable +(`americas`) that we might not have any other uses for. Just as we used +`%>%` to pipe data along a chain of `dplyr` functions we can use it to pass data +to `ggplot()`. Because `%>%` replaces the first argument in a function we don't +need to specify the `data =` argument in the `ggplot()` function. By combining +`dplyr` and `ggplot2` functions we can make the same figure without creating any +new variables or modifying the data. + +```{r} +gapminder %>% + # Filter countries located in the Americas + filter(continent == "Americas") %>% + # Make the plot + ggplot(mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +More examples of using the function `mutate()` and the `ggplot2` package. + +```{r} +gapminder %>% + # extract first letter of country name into new column + mutate(startsWith = substr(country, 1, 1)) %>% + # only keep countries starting with A or Z + filter(startsWith %in% c("A", "Z")) %>% + # plot lifeExp into facets + ggplot(aes(x = year, y = lifeExp, colour = continent)) + + geom_line() + + facet_wrap(vars(country)) + + theme_minimal() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge + +Calculate the average life expectancy in 2002 of 2 randomly selected countries +for each continent. Then arrange the continent names in reverse order. +**Hint:** Use the `dplyr` functions `arrange()` and `sample_n()`, they have +similar syntax to other dplyr functions. + +::::::::::::::: solution + +## Solution to Advanced Challenge + +```{r} +lifeExp_2countries_bycontinents <- gapminder %>% + filter(year==2002) %>% + group_by(continent) %>% + sample_n(2) %>% + summarize(mean_lifeExp=mean(lifeExp)) %>% + arrange(desc(mean_lifeExp)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to dplyr](https://dplyr.tidyverse.org/) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) +- [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) (online book) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `dplyr` package to manipulate data frames. +- Use `select()` to choose variables from a data frame. +- Use `filter()` to choose data based on values. +- Use `group_by()` and `summarize()` to work with subsets of data. +- Use `mutate()` to create new variables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/13-tidyr.Rmd b/locale/it/episodes/13-tidyr.Rmd new file mode 100644 index 000000000..96e59d18d --- /dev/null +++ b/locale/it/episodes/13-tidyr.Rmd @@ -0,0 +1,321 @@ +--- +title: Data Frame Manipulation with tidyr +teaching: 30 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand the concepts of 'longer' and 'wider' data frame formats and be able to convert between them with `tidyr`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I change the layout of a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE, stringsAsFactors = FALSE) +gap_wide <- read.csv("data/gapminder_wide.csv", header = TRUE, stringsAsFactors = FALSE) +``` + +Researchers often want to reshape their data frames from 'wide' to 'longer' +layouts, or vice-versa. The 'long' layout or format is where: + +- each column is a variable +- each row is an observation + +In the purely 'long' (or 'longest') format, you usually have 1 column for the observed variable and the other columns are ID variables. + +For the 'wide' format each row is often a site/subject/patient and you have +multiple observation variables containing the same type of data. These can be +either repeated observations over time, or observation of multiple variables (or +a mix of both). You may find data input may be simpler or some other +applications may prefer the 'wide' format. However, many of `R`'s functions have +been designed assuming you have 'longer' formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format. + +![](fig/14-tidyr-fig1.png){alt='Diagram illustrating the difference between a wide versus long layout of a data frame'} + +Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due +to its shape. However, the long format is more machine readable and is closer +to the formatting of databases. The ID variables in our data frames are similar to +the fields in a database and observed variables are like the database values. + +## Getting started + +First install the packages if you haven't already done so (you probably +installed dplyr in the previous lesson): + +```{r, eval=FALSE} +#install.packages("tidyr") +#install.packages("dplyr") +``` + +Load the packages + +```{r, message=FALSE} +library("tidyr") +library("dplyr") +``` + +First, lets look at the structure of our original gapminder data frame: + +```{r} +str(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Is gapminder a purely long, purely wide, or some intermediate format? + +::::::::::::::: solution + +## Solution to Challenge 1 + +The original gapminder data.frame is in an intermediate format. It is not +purely long since it had multiple observation variables +(`pop`,`lifeExp`,`gdpPercap`). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Sometimes, as with the gapminder dataset, we have multiple types of observed +data. It is somewhere in between the purely 'long' and 'wide' data formats. We +have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation +variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be +preferred despite not having ALL observations in 1 column given that all 3 +observation variables have different units. There are few operations that would +need us to make this data frame any longer (i.e. 4 ID variables and 1 +Observation variable). + +While using many of the functions in R, which are often vector based, you +usually do not want to do mathematical operations on values with different +units. For example, using the purely long format, a single mean for all of the +values of population, life expectancy, and GDP would not be meaningful since it +would return the mean of values with 3 incompatible units. The solution is that +we first manipulate the data either by grouping (see the lesson on `dplyr`), or +we change the structure of the data frame. **Note:** Some plotting functions in +R actually work better in the wide format data. + +## From wide to long format with pivot\_longer() + +Until now, we've been using the nicely formatted original gapminder dataset, but +'real' data (i.e. our own research data) will never be so well organized. Here +let's start with the wide formatted version of the gapminder dataset. + +> Download the wide version of the gapminder data from [this link to a csv file](data/gapminder_wide.csv) +> and save it in your data folder. + +We'll load the data file and look at it. Note: we don't want our continent and +country columns to be factors, so we use the stringsAsFactors argument for +`read.csv()` to disable that. + +```{r} +gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE) +str(gap_wide) +``` + +![](fig/14-tidyr-fig2.png){alt='Diagram illustrating the wide format of the gapminder data frame'} + +To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable. + +![](fig/14-tidyr-fig3.png){alt='Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +Here we have used piping syntax which is similar to what we were doing in the +previous lesson with dplyr. In fact, these are compatible and you can use a mix +of tidyr and dplyr functions by piping them together. + +We first provide to `pivot_longer()` a vector of column names that will be +pivoted into longer format. We could type out all the observation variables, but +as in the `select()` function (see `dplyr` lesson), we can use the `starts_with()` +argument to select all variables that start with the desired character string. +`pivot_longer()` also allows the alternative syntax of using the `-` symbol to +identify which variables are not to be pivoted (i.e. ID variables). + +The next arguments to `pivot_longer()` are `names_to` for naming the column that +will contain the new ID variable (`obstype_year`) and `values_to` for naming the +new amalgamated observation variable (`obs_value`). We supply these new column +names as strings. + +![](fig/14-tidyr-fig4.png){alt='Diagram illustrating the long format of the gapminder data'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(-continent, -country), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +That may seem trivial with this particular data frame, but sometimes you have 1 +ID variable and 40 observation variables with irregular variable names. The +flexibility is a huge time saver! + +Now `obstype_year` actually contains 2 pieces of information, the observation +type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the +`separate()` function to split the character strings into multiple variables + +```{r} +gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") +gap_long$year <- as.integer(gap_long$year) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Using `gap_long`, calculate the mean life expectancy, population, and gdpPercap for each continent. +**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +gap_long %>% group_by(continent, obs_type) %>% + summarize(means=mean(obs_values)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## From long to intermediate format with pivot\_wider() + +It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let's start with the intermediate format. + +The `pivot_wider()` function takes `names_from` and `values_from` arguments. + +To `names_from` we supply the column name whose contents will be pivoted into new +output columns in the widened data frame. The corresponding values will be added +from the column named in the `values_from` argument. + +```{r} +gap_normal <- gap_long %>% + pivot_wider(names_from = obs_type, values_from = obs_values) +dim(gap_normal) +dim(gapminder) +names(gap_normal) +names(gapminder) +``` + +Now we've got an intermediate data frame `gap_normal` with the same dimensions as +the original `gapminder`, but the order of the variables is different. Let's fix +that before checking if they are `all.equal()`. + +```{r} +gap_normal <- gap_normal[, names(gapminder)] +all.equal(gap_normal, gapminder) +head(gap_normal) +head(gapminder) +``` + +We're almost there, the original was sorted by `country`, then +`year`. + +```{r} +gap_normal <- gap_normal %>% arrange(country, year) +all.equal(gap_normal, gapminder) +``` + +That's great! We've gone from the longest format back to the intermediate and we +didn't introduce any errors in our code. + +Now let's convert the long all the way back to the wide. In the wide format, we +will keep country and continent as ID variables and pivot the observations +across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we +need to create appropriate labels for all our new variables (time\*metric +combinations) and we also need to unify our ID variables to simplify the process +of defining `gap_wide`. + +```{r} +gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_") +str(gap_temp) + +gap_temp <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") +str(gap_temp) +``` + +Using `unite()` we now have a single ID variable which is a combination of +`continent`,`country`,and we have defined variable names. We're now ready to +pipe in `pivot_wider()` + +```{r} +gap_wide_new <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +str(gap_wide_new) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Take this 1 step further and create a `gap_ludicrously_wide` format data by pivoting over countries, year and the 3 metrics? +**Hint** this new data frame should only have 5 rows. + +::::::::::::::: solution + +## Solution to Challenge 3 + +```{r} +gap_ludicrously_wide <- gap_long %>% + unite(var_names, obs_type, year, country, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now we have a great 'wide' format data frame, but the `ID_var` could be more +usable, let's separate it into 2 variables with `separate()` + +```{r} +gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_") +gap_wide_betterID <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) %>% + separate(ID_var, c("continent","country"), sep = "_") +str(gap_wide_betterID) + +all.equal(gap_wide, gap_wide_betterID) +``` + +There and back again! + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `tidyr` package to change the layout of data frames. +- Use `pivot_longer()` to go from wide to longer layout. +- Use `pivot_wider()` to go from long to wider layout. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/14-knitr-markdown.Rmd b/locale/it/episodes/14-knitr-markdown.Rmd new file mode 100644 index 000000000..5829180aa --- /dev/null +++ b/locale/it/episodes/14-knitr-markdown.Rmd @@ -0,0 +1,493 @@ +--- +title: Producing Reports With knitr +teaching: 60 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the value of writing reproducible reports +- Learn how to recognise and compile the basic components of an R Markdown file +- Become familiar with R code chunks, and understand their purpose, structure and options +- Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations +- Be aware of alternative output formats to which an R Markdown file can be exported + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I integrate software and reports? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r chunk_options, include=FALSE} +``` + +## Data analysis reports + +Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their +work for future reference. + +Many new users begin by first writing a single R script containing all of their +work, and then share the analysis by emailing the script and various graphs +as attachments. But this can be cumbersome, requiring a lengthy discussion to +explain which attachment was which result. + +Writing formal reports with Word or [LaTeX](https://www.latex-project.org/) +can simplify this process by incorporating both the analysis report and output graphs +into a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy "whack-a-mole" +game of fixing new mistakes resulting from a single formatting change. + +Creating a report as a web page (which is an html file) using R Markdown makes things easier. +The report can be one long stream, so tall figures that wouldn't ordinarily fit on +one page can be kept at full size and easier to read, since the reader can simply +keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend +more time on your analyses instead of writing reports. + +## Literate programming + +Ideally, such analysis reports are _reproducible_ documents: If an +error is discovered, or if some additional subjects are added to the +data, you can just re-compile the report and get the new or corrected +results rather than having to reconstruct figures, paste them into +a Word document, and hand-edit various detailed results. + +The key R package here is [`knitr`](https://yihui.name/knitr/). It allows you +to create a document that is a mixture of text and chunks of +code. When the document is processed by `knitr`, chunks of code will +be executed, and graphs or other results will be inserted into the final document. + +This sort of idea has been called "literate programming". + +`knitr` allows you to mix basically any type of text with code from different programming languages, but we recommend that you use `R Markdown`, which mixes Markdown +with R. [Markdown](https://www.markdownguide.org/) is a light-weight mark-up language for creating web +pages. + +## Creating an R Markdown file + +Within RStudio, click File → New File → R Markdown and +you'll get a dialog box like this: + +![](fig/New_R_Markdown.png){alt='Screenshot of the New R Markdown file dialogue box in RStudio'} + +You can stick with the default (HTML output), but give it a title. + +## Basic components of R Markdown + +The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want +to produce. In this case, we're creating an html document. + +``` +--- +title: "Initial R Markdown document" +author: "Karl Broman" +date: "April 23, 2015" +output: html_document +--- +``` + +You can delete any of those fields if you don't want them +included. The double-quotes aren't strictly _necessary_ in this case. +They're mostly needed if you want to include a colon in the title. + +RStudio creates the document with some example text to get you +started. Note below that there are chunks like + +
``{r}
+summary(cars)
+```
+
+ +These are chunks of R code that will be executed by `knitr` and replaced +by their results. More on this later. + +## Markdown + +Markdown is a system for writing web pages by marking up the text much +as you would in an email rather than writing html code. The marked-up +text gets _converted_ to html, replacing the marks with the proper +html code. + +For now, let's delete all of the stuff that's there and write a bit of +markdown. + +You make things **bold** using two asterisks, like this: `**bold**`, +and you make things _italics_ by using underscores, like this: +`_italics_`. + +You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this: + +``` +A list: + +* bold with double-asterisks +* italics with underscores +* code-type font with backticks +``` + +or like this: + +``` +A second list: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks +``` + +Each will appear as: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks + +You can use whatever method you prefer, but _be consistent_. This maintains the +readability of your code. + +You can make a numbered list by just using numbers. You can even use the +same number over and over if you want: + +``` +1. bold with double-asterisks +1. italics with underscores +1. code-type font with backticks +``` + +This will appear as: + +1. bold with double-asterisks +2. italics with underscores +3. code-type font with backticks + +You can make section headers of different sizes by initiating a line +with some number of `#` symbols: + +``` +# Title +## Main section +### Sub-section +#### Sub-sub section +``` + +You _compile_ the R Markdown document to an html webpage by clicking +the "Knit" button in the upper-left. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Create a new R Markdown document. Delete all of the R code chunks +and write a bit of Markdown (some sections, some italicized +text, and an itemized list). + +Convert the document to a webpage. + +::::::::::::::: solution + +## Solution to Challenge 1 + +In RStudio, select File > New file > R Markdown... + +Delete the placeholder text and add the following: + +``` +# Introduction + +## Background on Data + +This report uses the *gapminder* dataset, which has columns that include: + +* country +* continent +* year +* lifeExp +* pop +* gdpPercap + +## Background on Methods + +``` + +Then click the 'Knit' button on the toolbar to generate an html document (webpage). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## A bit more Markdown + +You can make a hyperlink like this: +`[Carpentries Home Page](https://carpentries.org/)`. + +You can include an image file like this: `![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)` + +You can do subscripts (e.g., F~2~) with `F~2~` and superscripts (e.g., +F^2^) with `F^2^`. + +If you know how to write equations in +[LaTeX](https://www.latex-project.org/), you can use `$ $` and `$$ $$` to insert math equations, like +`$E = mc^2$` and + +``` +$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$ +``` + +You can review Markdown syntax by navigating to the +"Markdown Quick Reference" under the "Help" field in the +toolbar at the top of RStudio. + +## R code chunks + +The real power of Markdown comes from +mixing markdown with chunks of code. This is R Markdown. When +processed, the R code will be executed; if they produce figures, the +figures will be inserted in the final document. + +The main code chunks look like this: + +
``{r load_data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +That is, you place a chunk of R code between \`\`\`{r chunk\_name} +and \`\`\`. You should give each chunk +a unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the shortcuts Ctrl\+Alt\+I on Windows and Linux, or Cmd\+Option\+I on Mac. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Add code chunks to: + +- Load the ggplot2 package +- Read the gapminder data +- Create a plot + +::::::::::::::: solution + +## Solution to Challenge 2 + +
``{r load-ggplot2}
+library("ggplot2")
+```
+
+ +
``{r read-gapminder-data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +
``{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## How things get compiled + +When you press the "Knit" button, the R Markdown document is +processed by [`knitr`](https://yihui.name/knitr) and a plain Markdown +document is produced (as well as, potentially, a set of figure files): the R code is executed +and replaced by both the input and the output; if figures are +produced, links to those figures are included. + +The Markdown and figure documents are then processed by the tool +[`pandoc`](https://pandoc.org/), which converts the Markdown file into an +html file, with the figures embedded. + +```{r rmd_to_html_fig, fig.width=8, fig.height=3, fig.align="left", echo=FALSE} +par(mar=rep(0, 4), bty="n", cex=1.5) +plot(0, 0, type="n", xlab="", ylab="", xaxt="n", yaxt="n", + xlim=c(0, 100), ylim=c(0, 100)) +xw <- 10 +yh <- 35 +xm <- 12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".Rmd") + +xm <- 50 +ym <- 80 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".md") +xm <- 50; ym <- 25 +for(i in c(2, 0, -2)) + rect(xm-xw/2+i, ym-yh/2+i, xm+xw/2+i, ym+yh/2+i, lwd=2, + border="black", col="white") +text(xm-2, ym-2, "figs/") + +xm <- 100-12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".html") + +arrows(22, 50, 38, 50, lwd=2, col="slateblue", len=0.1) +text((22+38)/2, 60, "knitr", col="darkslateblue", cex=1.3) + +arrows(62, 50, 78, 50, lwd=2, col="slateblue", len=0.1) +text((62+78)/2, 60, "pandoc", col="darkslateblue", cex=1.3) +``` + +## Chunk options + +There are a variety of options to affect how the code chunks are +treated. Here are some examples: + +- Use `echo=FALSE` to avoid having the code itself shown. +- Use `results="hide"` to avoid having any results printed. +- Use `eval=FALSE` to have the code shown but not evaluated. +- Use `warning=FALSE` and `message=FALSE` to hide any warnings or + messages produced. +- Use `fig.height` and `fig.width` to control the size of the figures + produced (in inches). + +So you might write: + +
``{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+ +Often there will be particular options that you'll want to use +repeatedly; for this, you can set _global_ chunk options, like so: + +
``{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+ +The `fig.path` option defines where the figures will be saved. The `/` +here is really important; without it, the figures would be saved in +the standard place but just with names that begin with `Figs`. + +If you have multiple R Markdown files in a common directory, you might +want to use `fig.path` to define separate prefixes for the figure file +names, like `fig.path="Figs/cleaning-"` and `fig.path="Figs/analysis-"`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use chunk options to control the size of a figure and to hide the +code. + +::::::::::::::: solution + +## Solution to Challenge 3 + +
``{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You can review all of the `R` chunk options by navigating to +the "R Markdown Cheat Sheet" under the "Cheatsheets" section +of the "Help" field in the toolbar at the top of RStudio. + +## Inline R code + +You can make _every_ number in your report reproducible. Use \`r and \` for an in-line code chunk, +like so: ` ``r "r round(some_value, 2)"`` `. The code will be +executed and replaced with the _value_ of the result. + +Don't let these in-line chunks get split across lines. + +Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with `include=FALSE` for that larger +chunk (which is the same as `echo=FALSE` and `results="hide"`). + +Rounding can produce differences in output in such situations. You may want +`2.0`, but `round(2.03, 1)` will give just `2`. + +The +[`myround`](https://github.com/kbroman/broman/blob/master/R/myround.R) +function in the [R/broman](https://github.com/kbroman/broman) package handles +this. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Try out a bit of in-line R code. + +::::::::::::::: solution + +## Solution to Challenge 4 + +Here's some inline code to determine that 2 + 2 = `r 2+2`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other output options + +You can also convert R Markdown to a PDF or a Word document. Click the +little triangle next to the "Knit" button to get a drop-down +menu. Or you could put `pdf_document` or `word_document` in the initial header +of the file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Creating PDF documents + +Creating .pdf documents may require installation of some extra software. The R +package `tinytex` provides some tools to help make this process easier for R users. +With `tinytex` installed, run `tinytex::install_tinytex()` to install the required +software (you'll only need to do this once) and then when you knit to pdf `tinytex` +will automatically detect and install any additional LaTeX packages that are needed to +produce the pdf document. Visit the [tinytex website](https://yihui.org/tinytex/) +for more information. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Visual markdown editing in RStudio + +RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like `**bold words**`) are +transformed to the formatted appearance (**bold words**) as you type. +This mode also includes a toolbar at the top with basic formatting buttons, +similar to what you might see in common word processing software programs. +You can turn visual editing on and off by pressing +the ![](fig/visual_mode_icon.png){alt='Icon for turning on and off the visual editing mode in RStudio, which looks like a pair of compasses'} +button in the top right corner of your R Markdown document. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Resources + +- [Knitr in a knutshell tutorial](https://kbroman.org/knitr_knutshell) +- [Dynamic Documents with R and knitr](https://www.amazon.com/exec/obidos/ASIN/1482203537/7210-20) (book) +- [R Markdown documentation](https://rmarkdown.rstudio.com) +- [R Markdown cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) +- [Getting started with R Markdown](https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/) +- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) (book by Rstudio team) +- [Reproducible Reporting](https://www.rstudio.com/resources/webinars/reproducible-reporting/) +- [The Ecosystem of R Markdown](https://www.rstudio.com/resources/webinars/the-ecosystem-of-r-markdown/) +- [Introducing Bookdown](https://www.rstudio.com/resources/webinars/introducing-bookdown/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Mix reporting written in R Markdown with software written in R. +- Specify chunk options to control formatting. +- Use `knitr` to convert these documents into PDF and other formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/episodes/15-wrap-up.Rmd b/locale/it/episodes/15-wrap-up.Rmd new file mode 100644 index 000000000..d9fa5b74f --- /dev/null +++ b/locale/it/episodes/15-wrap-up.Rmd @@ -0,0 +1,110 @@ +--- +title: Writing Good Software +teaching: 15 +exercises: 0 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe best practices for writing R and explain the justification for each. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write software that other people can use? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Structure your project folder + +Keep your project folder structured, organized and tidy, by creating subfolders for your code files, manuals, data, binaries, output plots, etc. It can be done completely manually, or with the help of RStudio's `New Project` functionality, or a designated package, such as `ProjectTemplate`. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: ProjectTemplate - a possible solution + +One way to automate the management of projects is to install the third-party package, `ProjectTemplate`. +This package will set up an ideal directory structure for project management. +This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. +Together with the default RStudio project functionality and Git you will be able to keep track of your +work as well as be able to share your work with collaborators. + +1. Install `ProjectTemplate`. +2. Load the library +3. Initialise the project: + +```{r, eval=FALSE} +install.packages("ProjectTemplate") +library("ProjectTemplate") +create.project("../my_project_2", merge.strategy = "allow.non.conflict") +``` + +For more information on ProjectTemplate and its functionality visit the +home page [ProjectTemplate](https://projecttemplate.net/index.html) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Make code readable + +The most important part of writing code is making it readable and understandable. +You want someone else to be able to pick up your code and be able to understand +what it does: more often than not this someone will be you 6 months down the line, +who will otherwise be cursing past-self. + +## Documentation: tell us what and why, not how + +When you first start out, your comments will often describe what a command does, +since you're still learning yourself and it can help to clarify concepts and +remind you later. However, these comments aren't particularly useful later on +when you don't remember what problem your code is trying to solve. Try to also +include comments that tell you _why_ you're solving a problem, and _what_ problem +that is. The _how_ can come after that: it's an implementation detail you ideally +shouldn't have to worry about. + +## Keep your code modular + +Our recommendation is that you should separate your functions from your analysis +scripts, and store them in a separate file that you `source` when you open the R +session in your project. This approach is nice because it leaves you with an +uncluttered analysis script, and a repository of useful functions that can be +loaded into any analysis script in your project. It also lets you group related +functions together easily. + +## Break down problem into bite size pieces + +When you first start out, problem solving and function writing can be daunting +tasks, and hard to separate from code inexperience. Try to break down your +problem into digestible chunks and worry about the implementation details later: +keep breaking down the problem into smaller and smaller functions until you +reach a point where you can code a solution, and build back up from there. + +## Know that your code is doing the right thing + +Make sure to test your functions! + +## Don't repeat yourself + +Functions enable easy reuse within a project. If you see blocks of similar +lines of code through your project, those are usually candidates for being +moved into functions. + +If your calculations are performed through a series of functions, then the +project becomes more modular and easier to change. This is especially the case +for which a particular input always gives a particular output. + +## Remember to be stylish + +Apply consistent style to your code. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Keep your project folder structured, organized and tidy. +- Document what and why, not how. +- Break programs into short single-purpose functions. +- Write re-runnable tests. +- Don't repeat yourself. +- Be consistent in naming, indentation, and other aspects of style. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/index.md b/locale/it/index.md new file mode 100644 index 000000000..c434e5efa --- /dev/null +++ b/locale/it/index.md @@ -0,0 +1,34 @@ +--- +site: sandpaper::sandpaper_site +--- + +_an introduction to R for non-programmers using gapminder data_ + +The goal of this lesson is to teach novice programmers to write modular code +and best practices for using R for data analysis. R is commonly used in many +scientific disciplines for statistical analysis and its array of third-party +packages. We find that many scientists who come to Software Carpentry workshops +use R and want to learn more. The emphasis of these materials is to give +attendees a strong foundation in the fundamentals of R, and to teach best +practices for scientific computing: breaking down analyses into modular units, +task automation, and encapsulation. + +Note that this workshop will focus on teaching the fundamentals of the +programming language R, and will not teach statistical analysis. + +The lesson contains more material than can be taught in a day. The [instructor notes page](instructors/instructor-notes.md) has some suggested lesson plans suitable for a one or half day workshop. + +A variety of third party packages are used throughout this workshop. These +are not necessarily the best, nor are they comprehensive, but they are +packages we find useful, and have been chosen primarily for their +usability. + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Understand that computers store data and instructions (programs, scripts etc.) in files. +Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/it/instructors/instructor-notes.md b/locale/it/instructors/instructor-notes.md new file mode 100644 index 000000000..43ffc4c20 --- /dev/null +++ b/locale/it/instructors/instructor-notes.md @@ -0,0 +1,132 @@ +--- +title: Instructor Notes +--- + +## Timing + +Leave about 30 minutes at the start of each workshop and another 15 mins +at the start of each session for technical difficulties like WiFi and +installing things (even if you asked students to install in advance, longer if +not). + +## Lesson Plans + +The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course. + +Some suggested paths through the material are: + +(suggested by [@liz-is](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-276529213)) + +- 01 Introduction to R and RStudio +- 04 Data Structures +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 08 Creating Publication-Quality Graphics with ggplot2 +- 10 Functions Explained +- 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +(suggested by [@naupaka](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-312547509)) + +- 01 Introduction to R and RStudio +- 02 Project Management With RStudio +- 03 Seeking Help +- 04 Data Structures +- 05 Exploring Data Frames +- 06 Subsetting Data +- 09 Vectorization +- 08 Creating Publication-Quality Graphics with ggplot2 _OR_ + 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +A half day course could consist of (suggested by [@karawoo](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-277599864)): + +- 01 Introduction to R and RStudio +- 04 Data Structures (only creating vectors with `c()`) +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 06 Subsetting Data (excluding factor, matrix and list subsetting) +- 08 Creating Publication-Quality Graphics with ggplot2 + +## Setting up git in RStudio + +There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to +the Options window in the RStudio application. + +- **Mac OS X:** + - Go RStudio -> Preferences... -> Git/SVN + - Check and see whether there is a path to a file in the "Git executable" window. If not, the next challenge is figuring out where Git is located. + - In the terminal enter `which git` and you will get a path to the git executable. In the "Git executable" window you may have difficulties finding the directory since OS X hides many of the operating system files. While the file selection window is open, pressing "Command-Shift-G" will pop up a text entry box where you will be able to type or paste in the full path to your git executable: e.g. /usr/bin/git or whatever else it might be. +- **Windows:** + - Go Tools -> Global options... -> Git/SVN + - If you use the Software Carpentry Installer, then 'git.exe' should be installed at `C:/Program Files/Git/bin/git.exe`. + +To prevent the learners from having to re-enter their password each time they push a commit to GitHub, this command (which can be run from a bash prompt) will make it so they only have to enter their password once: + +```bash +$ git config --global credential.helper 'cache --timeout=10000000' +``` + +## RStudio Color Preview + +RStudio has a feature to preview the color for certain named colors and hexadecimal colors. This may confuse or distract learners (and instructors) who are not expecting it. + +Mainly, this is likely to come up during the episode on "Data Structures" with the following code block: + +```r +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_string = c(1, 0, 1)) +``` + +This option can be turned off and on in the following menu setting: +Tools -> Global Options -> Code -> Display -> Enable preview of named and hexadecimal colors (under "Syntax") + +## Pulling in Data + +The easiest way to get the data used in this lesson during a workshop is to have +attendees download the raw data from [gapminder-data] and +[gapminder-data-wide]. + +Attendees can use the `File - Save As` dialog in their browser to save the file. + +## Overall + +Make sure to emphasize good practices: put code in scripts, and make +sure they're version controlled. Encourage students to create script +files for challenges. + +If you're working in a cloud environment, get them to upload the +gapminder data after the second lesson. + +Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a +lot of the esoteric behaviour encountered in basic operations. + +Vector recycling and function stacks are probably best explained +with diagrams on a whiteboard. + +Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is tremendously +useful. + +Be sure to show the CRAN task views, look at one of the topics. + +There's a lot of content: move quickly through the earlier lessons. Their +extensiveness is mostly for purposes of learning by osmosis: so that their +memory will trigger later when they encounter a problem or some esoteric behaviour. + +Key lessons to take time on: + +- Data subsetting - conceptually difficult for novices +- Functions - learners especially struggle with this +- Data structures - worth being thorough, but you can go through it quickly. + +Don't worry about being correct or knowing the material back-to-front. Use +mistakes as teaching moments: the most vital skill you can impart is how to +debug and recover from unexpected errors. + +[gapminder-data]: data/gapminder_data.csv +[gapminder-data-wide]: data/gapminder_wide.csv diff --git a/locale/it/learners/discuss.md b/locale/it/learners/discuss.md new file mode 100644 index 000000000..0605730b1 --- /dev/null +++ b/locale/it/learners/discuss.md @@ -0,0 +1,7 @@ +--- +title: Discussion +--- + +Please see [our other R lesson][r-gap] for a different presentation of these concepts. + +[r-gap]: https://swcarpentry.github.io/r-novice-gapminder/ diff --git a/locale/it/learners/reference.md b/locale/it/learners/reference.md new file mode 100644 index 000000000..a4c31f8db --- /dev/null +++ b/locale/it/learners/reference.md @@ -0,0 +1,342 @@ +--- +title: Reference +--- + +## Reference + +## [Introduction to R and RStudio](episodes/01-rstudio-intro.Rmd) + +- Use the escape key to cancel incomplete commands or running code + (Ctrl+C) if you're using R from the shell. +- Basic arithmetic operations follow standard order of precedence: + - Brackets: `(`, `)` + - Exponents: `^` or `**` + - Divide: `/` + - Multiply: `*` + - Add: `+` + - Subtract: `-` +- Scientific notation is available, e.g: `2e-3` +- Anything to the right of a `#` is a comment, R will ignore this! +- Functions are denoted by `function_name()`. Expressions inside the + brackets are evaluated before being passed to the function, and + functions can be nested. +- Mathematical functions: `exp`, `sin`, `log`, `log10`, `log2` etc. +- Comparison operators: `<`, `<=`, `>`, `>=`, `==`, `!=` +- Use `all.equal` to compare numbers! +- `<-` is the assignment operator. Anything to the right is evaluate, then + stored in a variable named to the left. +- `ls` lists all variables and functions you've created +- `rm` can be used to remove them +- When assigning values to function arguments, you _must_ use `=`. + +## [Project management with RStudio](episodes/02-project-intro.Rmd) + +- To create a new project, go to File -> New Project +- Install the `packrat` package to create self-contained projects +- `install.packages` to install packages from CRAN +- `library` to load a package into R +- `packrat::status` to check whether all packages referenced in your + scripts have been installed. + +## [Seeking help](episodes/03-seeking-help.Rmd) + +- To access help for a function type `?function_name` or `help(function_name)` +- Use quotes for special operators e.g. `?"+"` +- Use fuzzy search if you can't remember a name '??search\_term' +- [CRAN task views](https://cran.at.r-project.org/web/views) are a good starting point. +- [Stack Overflow](https://stackoverflow.com/) is a good place to get help with your code. + - `?dput` will dump data you are working from so others can load it easily. + - `sessionInfo()` will give details of your setup that others may need for debugging. + +## [Data structures](episodes/04-data-structures-part1.Rmd) + +Individual values in R must be one of 5 **data types**, multiple values can be grouped in **data structures**. + +**Data types** + +- `typeof(object)` gives information about an items data type. + +- There are 5 main data types: + + - `?numeric` real (decimal) numbers + - `?integer` whole numbers only + - `?character` text + - `?complex` complex numbers + - `?logical` TRUE or FALSE values + + **Special types:** + + - `?NA` missing values + - `?NaN` "not a number" for undefined values (e.g. `0/0`). + - `?Inf`, `-Inf` infinity. + - `?NULL` a data structure that doesn't exist + + `NA` can occur in any atomic vector. `NaN`, and `Inf` can only + occur in complex, integer or numeric type vectors. Atomic vectors + are the building blocks for all other data structures. A `NULL` value + will occur in place of an entire data structure (but can occur as list + elements). + +**Basic data structures in R:** + +- atomic `?vector` (can only contain one type) +- `?list` (containers for other objects) +- `?data.frame` two dimensional objects whose columns can contain different types of data +- `?matrix` two dimensional objects that can contain only one type of data. +- `?factor` vectors that contain predefined categorical data. +- `?array` multi-dimensional objects that can only contain one type of data + +Remember that matrices are really atomic vectors underneath the hood, and that +data.frames are really lists underneath the hood (this explains some of the weirder +behaviour of R). + +**[Vectors](episodes/04-data-structures-part1.Rmd)** + +- `?vector()` All items in a vector must be the same type. +- Items can be converted from one type to another using _coercion_. +- The concatenate function 'c()' will append items to a vector. +- `seq(from=0, to=1, by=1)` will create a sequence of numbers. +- Items in a vector can be named using the `names()` function. + +**[Factors](episodes/04-data-structures-part1.Rmd)** + +- `?factor()` Factors are a data structure designed to store categorical data. +- `levels()` shows the valid values that can be stored in a vector of type factor. + +**[Lists](episodes/04-data-structures-part1.Rmd)** + +- `?list()` Lists are a data structure designed to store data of different types. + +**[Matrices](episodes/04-data-structures-part1.Rmd)** + +- `?matrix()` Matrices are a data structure designed to store 2-dimensional data. + +**[Data Frames](episodes/05-data-structures-part2.Rmd)** + +- `?data.frame` is a key data structure. It is a `list` of `vectors`. +- `cbind()` will add a column (vector) to a data.frame. +- `rbind()` will add a row (list) to a data.frame. + +**Useful functions for querying data structures:** + +- `?str` structure, prints out a summary of the whole data structure +- `?typeof` tells you the type inside an atomic vector +- `?class` what is the data structure? +- `?head` print the first `n` elements (rows for two-dimensional objects) +- `?tail` print the last `n` elements (rows for two-dimensional objects) +- `?rownames`, `?colnames`, `?dimnames` retrieve or modify the row names + and column names of an object. +- `?names` retrieve or modify the names of an atomic vector or list (or + columns of a data.frame). +- `?length` get the number of elements in an atomic vector +- `?nrow`, `?ncol`, `?dim` get the dimensions of a n-dimensional object + (Won't work on atomic vectors or lists). + +## [Exploring Data Frames](episodes/05-data-structures-part2.Rmd) + +- `read.csv` to read in data in a regular structure + - `sep` argument to specify the separator + - "," for comma separated + - "\\t" for tab separated + - Other arguments: + - `header=TRUE` if there is a header row + +## [Subsetting data](episodes/06-data-subsetting.Rmd) + +- Elements can be accessed by: + + - Index + - Name + - Logical vectors + +- `[` single square brackets: + + - _extract_ single elements or _subset_ vectors + - e.g.`x[1]` extracts the first item from vector x. + - _extract_ single elements of a list. The returned value will be another `list()`. + - _extract_ columns from a data.frame + +- `[` with two arguments to: + + - _extract_ rows and/or columns of + - matrices + - data.frames + - e.g. `x[1,2]` will extract the value in row 1, column 2. + - e.g. `x[2,:]` will extract the entire second column of values. + +- `[[` double square brackets to extract items from lists. + +- `$` to access columns or list elements by name + +- negative indices skip elements + +## [Control flow](episodes/07-control-flow.Rmd) + +- Use `if` condition to start a conditional statement, `else if` condition to provide + additional tests, and `else` to provide a default +- The bodies of the branches of conditional statements must be indented. +- Use `==` to test for equality. +- `%in%` will return a `TRUE`/`FALSE` indicating if there is a match between an element and a vector. +- `X && Y` is only true if both X and Y are `TRUE`. +- `X || Y` is true if either X or Y, or both, are `TRUE`. +- Zero is considered `FALSE`; all other numbers are considered `TRUE` +- Nest loops to operate on multi-dimensional data. + +## [Creating publication quality graphics](episodes/08-plot-ggplot2.Rmd) + +- figures can be created with the grammar of graphics: + - `library(ggplot2)` + - `ggplot` to create the base figure + - `aes`thetics specify the data axes, shape, color, and data size + - `geom`etry functions specify the type of plot, e.g. `point`, `line`, `density`, `box` + - `geom`etry functions also add statistical transforms, e.g. `geom_smooth` + - `scale` functions change the mapping from data to aesthetics + - `facet` functions stratify the figure into panels + - `aes`thetics apply to individual layers, or can be set for the whole plot + inside `ggplot`. + - `theme` functions change the overall look of the plot + - order of layers matters! + - `ggsave` to save a figure. + +## [Vectorization](episodes/09-vectorization.Rmd) + +- Most functions and operations apply to each element of a vector +- `*` applies element-wise to matrices +- `%*%` for true matrix multiplication +- `any()` will return `TRUE` if any element of a vector is `TRUE` +- `all()` will return `TRUE` if _all_ elements of a vector are `TRUE` + +## [Functions explained](episodes/10-functions.Rmd) + +- `?"function"` +- Put code whose parameters change frequently in a function, then call it with + different parameter values to customize its behavior. +- The last line of a function is returned, or you can use `return` explicitly +- Any code written in the body of the function will preferably look for variables defined inside the function. +- Document Why, then What, then lastly How (if the code isn't self explanatory) + +## [Writing data](episodes/11-writing-data.Rmd) + +- `write.table` to write out objects in regular format +- set `quote=FALSE` so that text isn't wrapped in `"` marks + +## [Dataframe manipulation with dplyr](episodes/12-dplyr.Rmd) + +- `library(dplyr)` +- `?select` to extract variables by name. +- `?filter` return rows with matching conditions. +- `?group_by` group data by one of more variables. +- `?summarize` summarize multiple values to a single value. +- `?mutate` add new variables to a data.frame. +- Combine operations using the `?"%>%"` pipe operator. + +## [Dataframe manipulation with tidyr](episodes/13-tidyr.Rmd) + +- `library(tidyr)` +- `?pivot_longer` convert data from _wide_ to _long_ format. +- `?pivot_wider` convert data from _long_ to _wide_ format. +- `?separate` split a single value into multiple values. +- `?unite` merge multiple values into a single value. + +## [Producing reports with knitr](episodes/14-knitr-markdown.Rmd) + +- Value of reproducible reports +- Basics of Markdown +- R code chunks +- Chunk options +- Inline R code +- Other output formats + +## [Best practices for writing good code](episodes/15-wrap-up.Rmd) + +- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. +- Write tests before writing code in order to help determine exactly what that code is supposed to do. +- Know what code is supposed to do before trying to debug it. +- Make it fail every time. +- Make it fail fast. +- Change one thing at a time, and for a reason. +- Keep track of what you've done. +- Be humble + +## Glossary + +[argument]{#argument} +: A value given to a function or program when it runs. +The term is often used interchangeably (and inconsistently) with [parameter](#parameter). + +[assign]{#assign} +: To give a value a name by associating a variable with it. + +[body]{#body} +: (of a function): the statements that are executed when a function runs. + +[comment]{#comment} +: A remark in a program that is intended to help human readers understand what is going on, +but is ignored by the computer. +Comments in Python, R, and the Unix shell start with a `#` character and run to the end of the line; +comments in SQL start with `--`, +and other languages have other conventions. + +[comma-separated values]{#comma-separated-values} +: (CSV) A common textual representation for tables +in which the values in each row are separated by commas. + +[delimiter]{#delimiter} +: A character or characters used to separate individual values, +such as the commas between columns in a [CSV](#comma-separated-values) file. + +[documentation]{#documentation} +: Human-language text written to explain what software does, +how it works, or how to use it. + +[floating-point number]{#floating-point-number} +: A number containing a fractional part and an exponent. +See also: [integer](#integer). + +[for loop]{#for-loop} +: A loop that is executed once for each value in some kind of set, list, or range. +See also: [while loop](#while-loop). + +[index]{#index} +: A subscript that specifies the location of a single value in a collection, +such as a single pixel in an image. + +[integer]{#integer} +: A whole number, such as -12343. See also: [floating-point number](#floating-point-number). + +[library]{#library} +: In R, the directory(ies) where [packages](#package) are stored. + +[package]{#package} +: A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a [library](#library) and loaded using the library() function. + +[parameter]{#parameter} +: A variable named in the function's declaration that is used to hold a value passed into the call. +The term is often used interchangeably (and inconsistently) with [argument](#argument). + +[return statement]{#return-statement} +: A statement that causes a function to stop executing and return a value to its caller immediately. + +[sequence]{#sequence} +: A collection of information that is presented in a specific order. + +[shape]{#shape} +: An array's dimensions, represented as a vector. +For example, a 5×3 array's shape is `(5,3)`. + +[string]{#string} +: Short for "character string", +a [sequence](#sequence) of zero or more characters. + +[syntax error]{#syntax-error} +: A programming error that occurs when statements are in an order or contain characters +not expected by the programming language. + +[type]{#type} +: The classification of something in a program (for example, the contents of a variable) +as a kind of number (e.g. [floating-point number](#floating-point-number), [integer](#integer)), [string](#string), +or something else. In R the command typeof() is used to query a variables type. + +[while loop]{#while-loop} +: A loop that keeps executing as long as some condition is true. +See also: [for loop](#for-loop). diff --git a/locale/it/learners/setup.md b/locale/it/learners/setup.md new file mode 100644 index 000000000..736e10764 --- /dev/null +++ b/locale/it/learners/setup.md @@ -0,0 +1,8 @@ +--- +title: Setup +--- + +This lesson assumes you have R and RStudio installed on your computer. + +- [Download and install the latest version of R](https://www.r-project.org/). +- [Download and install RStudio](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features. You will need the free Desktop version for your computer. diff --git a/locale/it/profiles/learner-profiles.md b/locale/it/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/it/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. diff --git a/locale/ja/CODE_OF_CONDUCT.md b/locale/ja/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..72e962472 --- /dev/null +++ b/locale/ja/CODE_OF_CONDUCT.md @@ -0,0 +1,10 @@ +--- +title: Contributor Code of Conduct +--- + +プロジェクト貢献者・保持者は、 [カーペントリーの行動規範](https://carpentries-coc.readthedocs.io/ja/latest/topic_folders/policies/code-of-conduct.html)に基づき行動することを誓います。 + +嫌がらせ、ハラスメント行為、その他の悪意ある行動・言動は、 カーペントリーの[報告ガイドライン](https://carpentries-coc.readthedocs.io/ja/latest/topic_folders/policies/incident-reporting.html)に沿って報告させていただきます。 + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/locale/ja/CONTRIBUTING.md b/locale/ja/CONTRIBUTING.md new file mode 100644 index 000000000..684050340 --- /dev/null +++ b/locale/ja/CONTRIBUTING.md @@ -0,0 +1,94 @@ +## プロジェクトに貢献する + +[SoftwareCarpentry][swc-site]と[DataCarpentry][dc-site]はオープンソースのプロジェクトです。 コミュニティーからの資料提供・ご協力、例えば、 新しいレッスン、 既存の資料の修正、 バグレポート、 変更点のレビューなど、どんなに些細な変更も歓迎いたします。 + +### 貢献する方法 + +このプロジェクトに貢献することにより、 自身が提供したコンテンツを[私達のライセンス](License.md)に基づき配布する事に同意するものとします。 ご協力と引き換えに、 私達はあなたが提供する変更点・問題点などを検討し、 できるだけ早くコミュニティーの一員になれるよう尽力いたします。 Everyone involved in [The Carpentries][cp-site] agrees to abide by +our [code of conduct](CODE_OF_CONDUCT.md). + +### 貢献する方法 + +一番簡単に貢献する方法は、 誤字、言葉遣い、 間違った内容などを issue(イシュー)で報告する事です。 Issueを報告することによって、自分をコミュニティーに紹介し、 また、コミュニティーのメンバーと出会う良い機会にもなります。 + +1. If you do not have a [GitHub][github] account, you can [send us comments by + email][contact]. ですが、 以下の方法であればメールよりも早急に対応できる場合がありますので、そちらをお勧め致します。 + +2. またはアカウントを[新たに作る][github-join]気がある方で、 あまりGitに詳しくない・使い慣れていない方は、 質問・提案などを\[新しいイシュー]\[new-issue]として開いて下さい。 イシューを開くことによって、コミュニティーから誰かをそのイシューに割り当て、 スレッド化したディスカッションとして質問・提案に応答させていただくことができます。 + +3. If you are comfortable with Git, and would like to add or change material, + you can submit a pull request (PR). プルリクエストを使った提出方法は、[下記に記載されています](#using-github)。 For inspiration about changes that need to + be made, check out the [list of open issues][issues] across the Carpentries. + +Note: if you want to build the website locally, please refer to [The Workbench +documentation][template-doc]. + +### どこへ貢献するか + +1. If you wish to change this lesson, add issues and pull requests here. +2. If you wish to change the template used for workshop websites, please refer + to [The Workbench documentation][template-doc]. + +### 貢献していただきたい個所 + +新しい例を書く、すでにある例の改善、 ドキュメントのアップデート、 不明瞭な点、欠点、「動作に不具合がある」といった \[バグの報告]\[new-issue]など、 様々な方法で貢献していただくことができます。 +どういったイシューを開いたら良いかわからない場合は、 [このリポジトリのイシュー][issues]、 [Data Carpentryのイシュー][dc-issues]、 もしくは[Software Carpentryのイシュー][swc-issues]を見てみて下さい。 + +Comments on issues and reviews of pull requests are just as welcome: we are +smarter together than we are on our own. すでにあるイシューへのコメントや、プルリクエストのレビューなども歓迎いたします。 皆さんで協力したほうが、良い結果につながります。 また、新しく加入された方の意見やレビューなどは特に重要視しています。 レッスンの資料を幾度となく見てきた方は特に見落としがちなのですが、 私達が提供している資料・コンテンツは、初めて資料を見る方などには、理解するのに時間が掛かる場合があるので、 通常とは違う視点からの意見は大変貴重なのです。 + +### 貢献していただきたくない個所 + +Our lessons already contain more material than we can cover in a typical +workshop, so we are usually _not_ looking for more concepts or tools to add to +them. As a rule, if you want to introduce a new idea, you must (a) estimate how +long it will take to teach and (b) explain what you would take out to make room +for it. The first encourages contributors to be honest about requirements; the +second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one +platform. Our workshops typically contain a mixture of Windows, macOS, and +Linux users; in order to be usable, our lessons must run equally well on all +three. + +### GitHubの使い方 + +GitHubから資料を提供したい場合は、 [GitHubでオープンソース・プロジェクトに貢献する方法][how-contribute] を参照して下さい。 私達は[GitHub flow][github-flow]を使って変更点などを管理しています: + +1. 自身が持っているオリジナルのリポジトリのコピー(フォーク)に新しいブランチを作り、そのブランチで内容を変更します。 +2. 作ったブランチ内で変更点をコミットします。 +3. そのブランチをGitHubのフォークにプッシュします。 +4. 自身のフォークからオリジナルの[リポジトリ][repo]へプルリクエストを提出します。 +5. 頂いたコメントやレビューからの提案で、更に内容を変更する場合は、 自分のブランチで内容を変更し、GitHubのフォークにプッシュして下さい: 自動的にプルリクエストの内容がアップデートされます。 + +NB: The published copy of the lesson is usually in the `main` branch. + +全てのレッスンには二人のメインテイナーがおり、彼・彼女らがイシューやプルリクエストを管理・見直す、 もしくはその他のメンバーに、一緒に見直すように声をかけます。 メインテイナー達はコミュニティーのボランティアですので、 最終的に何を変更するかの決定権は、メインテイナーに委ねられています。 + +### その他の資料 + +The Carpentries is a global organisation with volunteers and learners all over +the world. We share values of inclusivity and a passion for sharing knowledge, +teaching and learning. There are several ways to connect with The Carpentries +community listed at \ including via social +media, slack, newsletters, and email lists. また、[メール][contact]からでもご連絡いただけます。 + +[repo]: https://github.com/swcarpentry/r-novice-gapminder +[repo-issues]: https://github.com/swcarpentry/r-novice-gapminder/issues +[contact]: mailto:team@carpentries.org +[cp-site]: https://carpentries.org/ +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry +[dc-lessons]: https://datacarpentry.org/lessons/ +[dc-site]: https://datacarpentry.org/ +[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss +[github]: https://github.com +[github-flow]: https://guides.github.com/introduction/flow/ +[github-join]: https://github.com/join +[how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github +[issues]: https://carpentries.org/help-wanted-issues/ +[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry +[swc-lessons]: https://software-carpentry.org/lessons/ +[swc-site]: https://software-carpentry.org/ +[lc-site]: https://librarycarpentry.org/ +[template-doc]: https://carpentries.github.io/workbench/ diff --git a/locale/ja/LICENSE.md b/locale/ja/LICENSE.md new file mode 100644 index 000000000..b7d1ea952 --- /dev/null +++ b/locale/ja/LICENSE.md @@ -0,0 +1,55 @@ +--- +title: Licenses +--- + +## 教材 + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +あなたは以下の条件に従う限り、自由に: + +- 共有 ---どのようなメディアやフォーマットでも資料を複製したり、再配布できます +- 翻案 ---マテリアルをリミックスしたり、改変したり、別の作品のベースにしたりできます + +営利目的も含め、どのような目的でも。 + +あなたがライセンスの条件に従っている限り、許諾者がこれらの自由を 取り消すことはできません。 + +あなたの従うべき条件は以下の通りです: + +- ソフトウェアカーペントリーの著作物 (Software Carpentry©) から派生していることを記載して、 そして適切な場合は[https://carpentries.org](http://software-carpentry.org)へのリンクを表示), [ライセンスへのリンク](https://creativecommons.org/licenses/by/4.0/deed.ja)を 提供し、変更があったらその旨を示さなければなりません。 これらは合理的であればどのような方法で行っても構いませんが、 許諾者があなたやあなたの利用行為を支持していると示唆するような方法は除きます。 + +- 追加的な制約は課せません ---あなたは、このライセンスが他の者に 許諾することを法的に制限するようないかなる法的規定も技術的手段も 適用してはなりません: With the understanding that: + +ご注意: + +- You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +- No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## ソフトウェア + +特に記載がある場合を除いて、ソフトウェアカーペントリーおよびデータカーペントリーが 提供しているサンプルプログラムやソフトウェアは、 [OSI][osi]が承認した [MITライセンス](https://ja.osdn.net/projects/opensource/wiki/licenses%2FMIT_license)の下で利用可能です。 + +以下に定める条件に従い、本ソフトウェアおよび関連文書のファイル (以下「ソフトウェア」)の複製を取得するすべての人に対し、ソフトウェアを 無制限に扱うことを無償で許可します。これには、ソフトウェアの複製を使用、 複写、変更、結合、掲載、頒布、サブライセンス、および/または販売する権利、 およびソフトウェアを提供する相手に同じことを許可する権利も無制限に含まれます。 + +上記の著作権表示および本許諾表示を、ソフトウェアのすべての複製または 重要な部分に記載するものとします。 + +ソフトウェアは「現状のまま」で、明示であるか暗黙であるかを問わず、 何らの保証もなく提供されます。ここでいう保証とは、商品性、 特定の目的への適合性、および権利非侵害についての保証も含みますが、 それに限定されるものではありません。 作者または著作権者は、契約行為、 不法行為、またはそれ以外であろうと、ソフトウェアに起因または関連し、 あるいはソフトウェアの使用またはその他の扱いによって生じる一切の請求、 損害、その他の義務について何らの責任も負わないものとします。 + +## 商標 + +「Software Carpentry」と「Data Carpentry」およびそれぞれのロゴは[Community Initiatives][CI]における登録商標または商標です。 + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/locale/ja/README.md b/locale/ja/README.md new file mode 100644 index 000000000..f6d1289a7 --- /dev/null +++ b/locale/ja/README.md @@ -0,0 +1,6 @@ +# Internationalisation hub repository for Software Carpentry R for Reproducible Scientific Analysis + +ブログラマーでない人のための gapminder データを用いた R 入門。 +Please see [https://swcarpentry.github.io/r-novice-gapminder](https://swcarpentry.github.io/r-novice-gapminder) for a rendered version of this material in English. + +More info to follow. diff --git a/locale/ja/config.yaml b/locale/ja/config.yaml new file mode 100644 index 000000000..fb1e3c895 --- /dev/null +++ b/locale/ja/config.yaml @@ -0,0 +1,71 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'swc' +#Overall title for pages. +title: '再現可能な化学分析のためのR' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2015-04-18' +#Comma-separated list of keywords for the lesson +keywords: 'ソフトウェア, データ, レッスン, カーペントリーズ' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/swcarpentry-ja/r-novice-gapminder' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'team@carpentries.org' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 01-rstudio-intro.Rmd + - 02-project-intro.Rmd + - 03-seeking-help.Rmd + - 04-data-structures-part1.Rmd + - 05-data-structures-part2.Rmd + - 06-data-subsetting.Rmd + - 07-control-flow.Rmd + - 08-plot-ggplot2.Rmd + - 09-vectorization.Rmd + - 10-functions.Rmd + - 11-writing-data.Rmd + - 12-dplyr.Rmd + - 13-tidyr.Rmd + - 14-knitr-markdown.Rmd + - 15-wrap-up.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://swcarpentry.github.io/r-novice-gapminder' +analytics: carpentries +lang: ja diff --git a/locale/ja/episodes/01-rstudio-intro.Rmd b/locale/ja/episodes/01-rstudio-intro.Rmd new file mode 100644 index 000000000..5099e49f0 --- /dev/null +++ b/locale/ja/episodes/01-rstudio-intro.Rmd @@ -0,0 +1,634 @@ +--- +title: RとRStudio入門 +teaching: 45 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- RStudio IDE の各ウィンドウの使用目的と使い方が説明出来るようになりましょう。 +- RStudio IDE のボタンやオプションの位置を理解しましょう。 +- 変数が定義出来るようになりましょう。 +- 変数に値の設定が出来るようになりましょう。 +- R セッションのワークスペース管理が出来るようになりましょう。 +- 算術演算子や比較演算子が使えるようになりましょう。 +- 関数が呼び出せるようになりましょう。 +- パーッケージをロードしましょう。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- RStudio はどのように操作したらよいですか? +- R とはどのようにやりとりしたらよいですか? +- 環境の管理はどうしたらよいですか? +- パッケージのインストールはどうしたらよいですか? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## ワークショップを始める前に + +RとRStudioの最新バージョンが、自分のコンピューターにインストールされているか確認してください。 最新バージョンであることが重要である理由は、ワークショップで使うパッケージには、Rが最新でないと、正常に(または全く)インストールされないものがあるからです。 + +- [ここからRの最新バージョンをダウンロード及びインストール下さい](https://www.r-project.org/) +- [ここからRStudioをダウンロード及びインストール下さい](https://www.rstudio.com/products/rstudio/download/#download) + +## R とRStudio を使う理由 + +Software CarpentryのR部分のワークショップへようこそ。 + +科学は多段階のプロセスです。 +実験を設計してデータを収集した後、本当の楽しみは分析から始まります!このレッスンでは、R言語の基本を教えるとともに、科学プロジェクトのためのコードを整理するベストプラクティスを学び、作業をより簡単にする方法を紹介します。 生データから始め、探索的な分析を行い、結果をグラフでプロットする方法を学びます。この例では、[gapminder.org](https://www.gapminder.org)のデータセットを使用し、時間を通じた各国の人口情報を扱います。データをRに読み込むことができますか?セネガルの人口をプロットできますか?アジア大陸の国々の平均所得を計算できますか?このレッスンの終わりまでに、これらの国々の人口を1分以内にプロットできるようになります! + +データを分析するためにMicrosoft ExcelやGoogleスプレッドシートを使用することもできますが、これらのツールは柔軟性やアクセス性に限界があります。さらに、元データの変更や探索の手順を共有することが難しいため、これは「再現可能な」研究にとって重要なポイントです([再現可能な研究についてはこちら](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285))。 特に、生データの探索や変更を行うステップを共有するのが難しいため、[「再現可能」](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)な研究には不向きです。 + +したがって、このレッスンでは、RとRStudioを使用してデータの探索を始める方法を学びます。RプログラムはWindows、Mac、Linuxオペレーティングシステムで利用可能で、上記のリンクから無料でダウンロードできます。Rを実行するために必要なのはRプログラムだけです。 R プログラムはWindows、Mac、Linux のオペレーティングシステムで利用でき、無料で上記からダウンロード可能です。 R を実行するにはR プログラムだけで十分です。 + +しかし、Rをより使いやすくするために、同じくダウンロードしたRStudioというプログラムを使用します。RStudioは無料でオープンソースの統合開発環境(IDE)で、組み込みエディタを提供し、すべてのプラットフォームで動作します(サーバー上でも利用可能)。バージョン管理やプロジェクト管理との統合など、多くの利点があります。 RStudioは、無料であり、Rを組み込んだオープンソースの 総合開発環境(IDE: Integrated Development Environment)です。 RStudioは、全てのプラットフォーム(サーバーも含む)で起動できること、 エディタが組み込まれていることや、プロジェクト管理やバージョン管理にも対応しているなど、 良いところがいっぱいがあります。 + +## 概要 + +生のデータから、予備解析をし、結果をどう グラフ上にプロットするかを学びます。 ここでの例は [gapminder.org](https://www.gapminder.org) のデータセットを使います。このデータセットには、多くの国の人口の時系列データが入っています。 データをRに読み込むことができますか? セネガルの人口をプロットできますか? アジア大陸にある国の平均所得を計算できますか? +これらのレッスンの終わるまでに、これらの全ての国の人口をプロットしたりするようなことが 一分足らずでできるようになるでしょう! + +**基本レイアウト** + +RStudioを初めて開くと、次の3つのパネルが表示されます: + +- インタラクティブなRコンソール/ターミナル(左全体) +- 環境/履歴/接続(右上のタブ) +- ファイル/プロット/パッケージ/ヘルプ/ビューア(右下のタブ) + +![](fig/01-rstudio.png){alt='RStudioのレイアウト'} + +ファイル(例えばRスクリプト)を開くと、上部左にエディタパネルも表示されます。 + +![](fig/01-rstudio-script.png){alt='RStudioで.Rファイルを開いたレイアウト'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## Rスクリプト + +Rコンソールに書いたコマンドはファイルに保存し、再実行することができます。このようなRコードを含むファイルをRスクリプトと呼びます。Rスクリプトは名前の末尾が`.R`となっており、それがRスクリプトであることを示します。 このように実行される R コードを含むファイルは R スクリプトと呼ばれます。 R スクリプトには `.R` が名前の末尾にが付けられています。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## RStudio内でのワークフロー + +RStudio内で作業する主な方法は2つあります: + +1. インタラクティブなRコンソール内でテストや試行を行い、そのコードをコピーして.Rファイルに貼り付け、後で実行する。 + +- 小規模なテストや初期の段階では効果的です。 +- すぐ面倒になる。 + +2. 最初から.Rファイルに記述し、RStudioのショートカットキーを使用して「Run」コマンドを実行し、現在の行、選択した行、または変更した行をインタラクティブなRコンソールに送る。 + +- これは作業を始める良い方法です。すべてのコードが後で使用するために保存されます。 +- RStudio内またはRの`source()`関数を使用して作成したファイルを実行できます。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:コードのセグメントを実行する + +RStudioでは、エディタウィンドウからコードを実行する柔軟性があります。ボタン、メニューオプション、およびキーボードショートカットがあります。現在の行を実行するには、次の方法があります: ボタン、メニュー選択、そしてキーボードのショートカットがあります。 現在の行を走らせるには、 + +1. エディタパネルの上部にある「Run」ボタンをクリックする +2. 「Code」メニューから「Run Lines」を選択する +3. WindowsかLinuxなら、Ctrl\+Return 、またはOS Xなら、\+Return を押す(このショートカットはボタンの上にマウスを合わせると表示される)。 ードのかたまりを走らせるには、まずその部分を選択してから`Run`を押します。 + WindowsまたはLinuxではCtrl\+Return、OS Xでは\+Returnを押す + (このショートカットはボタンの上にマウスをホバーさせると確認できます)。コードブロックを実行するには、選択して「Run」をクリックします。 + 最近実行したコードブロックを修正した場合、セクションを再選択して「Run」を押す必要はありません。次のボタン「Re-run the previous region」を使用すると、修正を含む前のコードブロックを実行できます。 これは、前のコードのかたまりを 修正を行った部分を含めて走らせます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## R入門 + +Rを使う時間のほとんどは、Rのインタラクティブコンソールでの作業となるでしょう。 ここが、全てのコードを走らせる、また、.Rファイルにコードを加える前にアイディアを 試してみるのに使える環境となります。 Rでの作業の多くは、インタラクティブなRコンソール内で行います。ここでは、すべてのコードを実行し、Rスクリプトファイルに追加する前にアイデアを試すのに便利な環境です。RStudioのコンソールは、コマンドライン環境で`R`と入力した場合と同じです。 + +Rインタラクティブセッションでまず目に入ってくるのは、ひとかたまりの情報と その後に続く、「 」と点滅するカーソルです。これは多くの観点で、 シェルのレッスンで学んだシェル環境と似ています。 Rのインタラクティブセッションを開くと、最初に情報が表示され、その後に「>」と点滅するカーソルが現れます。これは、シェルレッスンで学んだシェル環境と多くの点で似ています。「Read, evaluate, print loop」(読み取り、評価、印刷ループ)の考え方に基づいて動作します:コマンドを入力すると、Rがそれを実行し、結果を返します。 + +## Rを計算機として使う + +Rで最も簡単なことは、算術を行うことです: + +```{r} +1 + 100 +``` + +And R will print out the answer, with a preceding "[1]". [1] is the index of +the first element of the line being printed in the console. Rは答えを表示し、その前に"[1]"を付けます。[1]はコンソールに表示される行の最初の要素のインデックスを示します。ベクトルのインデックスについての詳細は、[エピソード6:データのサブセット化](https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/index.html)を参照してください。 + +不完全なコマンドを入力すると、R は完了を待機します。Unix Shellのbashに慣れている場合、この動作をbashで見たことがあるかもしれません。 + +```r +> 1 + +``` + +```output ++ +``` + +Any time you hit return and the R session shows a "+" instead of a ">", it +means it's waiting for you to complete the command. 「>」ではなく「+」が表示された場合、Rはコマンドの完了を待機しています。コマンドをキャンセルしたい場合はEscを押すと、RStudioは「>」プロンプトに戻ります。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:コマンドのキャンセル + +RStudioではなく、コマンドラインからRを使う場合、 コマンドを取り消す場合、Escの代わりにCtrl+Cを使う必要があります。これは、Macユーザーも同じです! コマンドの取り消しは、不完全なコマンドを消す他にも使えます。 + +コマンドのキャンセルは、不完全なコマンドを終了させるだけでなく、予想以上に時間がかかる場合にコードの実行を停止したり、現在書いているコードを削除したりするためにも役立ちます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Rを計算機として使用する場合、演算の順序は学校で学んだものと同じです。 + +優先順位が高いものから低いものへ: + +- 括弧:`(`, `)` +- 累乗:`^`または`**` +- 乗算:`*` +- 除算:`/` +- 加算:`+` +- 減算:`-` + +```{r} +3 + 5 * 2 +``` + +評価の順序を変更したい場合や意図を明確にしたい場合は、括弧を使用してグループ化します。 + +```{r} +(3 + 5) * 2 +``` + +括弧をつける必要ではないときは、面倒かもしれませんが、そうすることで自分の意図 するところがはっきり伝わります。 +必要ない場合は煩雑になりますが、意図を明確にできます。他の人が後でコードを読むかもしれないことを忘れないでください。 + +```{r, eval=FALSE} +(3 + (5 * (2 ^ 2))) # 読みにくい +3 + 5 * 2 ^ 2 # 規則を覚えていれば明快 +3 + 5 * (2 ^ 2) # 規則を忘れた場合はこれが助けになる +``` + +各コード行の後にあるテキストは「コメント」と呼ばれます。ハッシュ記号`#`の後に続く内容は、コードを実行する際にRによって無視されます。 シャープ(ナンバー)記号`#` の後に来るものは、Rがコードを実行する際は無視されます。 + +非常に小さいまたは大きい数値は、科学表記法で表示されます: + +```{r} +2/10000 +``` + +しかし、すぐに手間がかかるようになります。 これは「`10^XX`で掛ける」という短縮形です。したがって、`2e-4`は`2 * 10^(-4)`の短縮形です。 + +科学表記法で数値を書くこともできます: + +```{r} +5e3 # マイナスがない点に注意 +``` + +## 数学関数 + +Rには多くの組み込み数学関数があります。関数を呼び出すには、関数名を入力し、その後に開き括弧と閉じ括弧を続けます。関数は引数を入力として受け取ります。関数の括弧内に入力したものはすべて引数と見なされます。関数によって引数の数は異なり、引数を必要としないものから複数の引数を必要とするものまであります。例: To call a function, +we can type its name, followed by open and closing parentheses. +Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. 例えば: + +```{r, eval=FALSE} +getwd() # 絶対パスを返す +``` + +この例では引数は不要ですが、以下の数学関数では結果を計算するために値を渡す必要があります。 + +```{r} +sin(1) # 三角関数 +``` + +```{r} +log(1) # 自然対数 +``` + +```{r} +log10(10) # 常用対数(底10) +``` + +```{r} +exp(0.5) # e^(1/2) +``` + +Rのすべての関数を覚えようとする必要はありません。Googleで検索するか、関数名の最初の数文字を覚えていれば、RStudioのタブ補完機能を使うことができます。 + +RStudioの大きな利点の一つは、オートコンプリート機能があることです。これにより、関数、引数、および受け取る値を簡単に調べることができます。 + +コマンド名の前に、`?` を付けることで、そのコマンドのヘルプのページを開くことができます。 コマンド名の前に`?`を付けると、そのコマンドのヘルプページが開きます。RStudioを使用している場合、'Help'ペインに表示されます。ターミナルでRを使用している場合は、ブラウザでヘルプページが開きます。ヘルプページにはコマンドの詳細な説明と動作の仕組みが含まれています。ページの下部までスクロールすると、通常、コマンドの使用例が掲載されています。後ほど例を見ていきます。 +The help page will include a detailed description of the command and +how it works. Scrolling to the bottom of the help page will usually +show a collection of code examples which illustrate command usage. +これについては、後ほど、例で見てみることにしましょう。 + +## 比較演算 + +Rでは比較を行うこともできます: + +```{r} +1 == 1 # 等しい(等号が2つ、"等しい"と読む) +``` + +```{r} +1 != 2 # 等しくない("等しくない"と読む) +``` + +```{r} +1 < 2 # より小さい +``` + +```{r} +1 <= 1 # 以下 +``` + +```{r} +1 > 0 # より大きい +``` + +```{r} +1 >= -9 # 以上 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:数値の比較 + +数値を比較する際の注意点として、整数(小数を含まない数値型)以外を比較する場合は、`==`を使用しないでください。 + +コンピュータは小数を特定の精度でしか表現できないため、Rが表示する際に同じに見える2つの数値が、内部表現では異なる場合があります。このわずかな差異は「数値計算誤差(Machine numeric tolerance)」と呼ばれます。 + +代わりに`all.equal`関数を使用してください。 + +さらに詳しく知りたい方はこちら:[http://floating-point-gui.de/](https://floating-point-gui.de/) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 変数と代入 + +代入演算子`<-`を使用して、値を変数に格納できます: + +```{r} +x <- 1/40 +``` + +Notice that assignment does not print a value. Instead, we stored it for later +in something called a **variable**. 以前は`x`に0.025が格納されていましたが、現在は100が格納されています。 + +```{r} +x +``` + +正確には、この格納された値は[浮動小数点数](https://en.wikipedia.org/wiki/Floating_point)と呼ばれる分数の_10進数近似値_です。 + +RStudioの右上ペインにある`Environment`タブを確認すると、`x`とその値が表示されていることがわかります。変数`x`は、数値を期待する計算の中で数値の代わりに使用できます: Our variable `x` can be used in place of a number in any calculation that expects a number: + +```{r} +log(x) +``` + +また、変数には再代入も可能です: + +```{r} +x <- 100 +``` + +`x`は、0.025という値でしたが、今は、100になりました。 + +代入値には、代入先の変数を含めることもできます: + +```{r} +x <- x + 1 # RStudioの右上タブでxの説明が更新されることに注目 +y <- x * 2 +``` + +代入の右辺には有効なR式を使用できます。右辺は代入が行われる前に_完全に評価_されます。 +右側については、代入される前に、 計算が完全に実施 されます。 + +変数名には、文字、数字、アンダースコア、ピリオドを含めることができますが、スペースは含められません。また、変数名は文字またはピリオドで始める必要があります(数字やアンダースコアでは始めることはできません)。ピリオドで始まる変数は隠し変数と見なされます。 They +must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). +Variables beginning with a period are hidden variables. +長い変数名については、異なる人が異なる規約を使用します。その例として: + +- ピリオドを.単語の.間に入れる +- underscores_between_words +- 単語の始まりを大文字にする(camelCaseToSeparateWords) + +どれを使用するかは自由ですが、**一貫性を保つ**ことが重要です。 + +代入には`=`演算子を使用することも可能です: + +```{r} +x = 1/40 +``` + +But this is much less common among R users. The most important thing is to +**be consistent** with the operator you use. しかし、これはRユーザーの間ではあまり一般的ではありません。最も重要なのは、使用する演算子に**一貫性を持つ**ことです。`<-`を使用したほうが混乱が少ない場合もあり、コミュニティでは最も一般的に使われています。そのため、`<-`を使用することを推奨します。 So the recommendation is to use `<-`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 1 + +次の中で有効なRの変数名はどれですか? + +```{r, eval=FALSE} +min_height +max.height +_age +.mass +MaxLength +min-length +2widths +celsius2kelvin +``` + +::::::::::::::: solution + +## チャレンジ1の解答 + +次のものはR変数として使用できます: + +```{r ch1pt1-sol, eval=FALSE} +min_height +max.height +MaxLength +celsius2kelvin +``` + +次のものは隠し変数を作成します: + +```{r ch1pt2-sol, eval=FALSE} +.mass +``` + +次のものは変数を作成できません: + +```{r ch1pt3-sol, eval=FALSE} +_age +min-length +2widths +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## ベクトル化 + +One final thing to be aware of is that R is _vectorized_, meaning that +variables and functions can have vectors as values. Rの特徴の1つに、Rが**ベクトル化**されているという点があります。つまり、変数や関数にベクトルを値として持たせることができます。物理学や数学におけるベクトルとは異なり、Rにおけるベクトルは同じデータ型の値が順序付けられた集合を指します。例: 例えば: + +```{r} +1:5 +2^(1:5) +x <- 1:5 +2^x +``` + +この機能は非常に強力で、今後のレッスンでさらに詳しく説明します。 + +## 環境の管理 + +Rセッションとやり取りするための便利なコマンドがいくつかあります。 + +`ls`を使用すると、グローバル環境(現在のRセッション)に保存されているすべての変数と関数を一覧表示できます: + +```{r, eval=FALSE} +ls() +``` + +```{r, echo=FALSE} +# このRmdドキュメントをレンダリングする際に`ls()`を実行すると、 +# 教材で説明されている内容と異なる項目(例えば "args", "dest_md" など)が +# 出力されることがあります。 +# 学習者が自分のセッションで`ls()`を実行した際に観察される結果を再現するために、 +# 一時環境を使用して解決しています。 + +temp.env <- new.env() +temp.env$x <- x +temp.env$y <- y +ls(temp.env) +rm(temp.env) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:隠しオブジェクト + +シェルのように、`ls`はデフォルトでは、"."で始まる変数と関数を表示しません。 シェルと同様に、`ls`ではデフォルトで"."で始まる変数や関数は表示されません。すべてのオブジェクトを一覧表示するには、`ls(all.names=TRUE)`と入力してください。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +ここでは、`ls`に引数を渡していませんが、関数を呼び出すために括弧は必要です。 + +`ls`だけを入力すると、オブジェクト一覧ではなくコードが表示されます: + +```{r} +ls +``` + +これはどういうことでしょうか? + +Rではすべてがオブジェクトであり、オブジェクト名だけを入力すると、その内容が表示されます。先ほど作成したオブジェクト`x`には`r x`が格納されています: The object `x` that we +created earlier contains `r x`: + +```{r} +x +``` + +オブジェクト`ls`には、`ls`関数を動作させるRコードが格納されています!関数の仕組みや作成方法については後のレッスンで説明します。 We'll talk +more about how functions work and start writing our own later. + +不要になったオブジェクトを削除するには、`rm`を使用します: + +```{r, eval=FALSE} +rm(x) +``` + +多くのオブジェクトが環境にあり、それらをすべて削除したい場合は、`ls`の結果を`rm`関数に渡します: + +```{r, eval=FALSE} +rm(list = ls()) +``` + +In this case we've combined the two. この場合、2つの関数を組み合わせています。演算の順序と同様に、最も内側の括弧内の内容が最初に評価されます。 + +この場合、`ls`の結果が`rm`の`list`引数として使用されるよう指定しています。引数に値を名前で割り当てる場合、**必ず`=`演算子を使用**する必要があります! 代入は値を表示しません。代わりに、それを後で使用するために**変数**というものに格納します。この場合、`x`には値`0.025`が格納されています: + +代わりに`<-`を使用すると、予期しない副作用が発生するか、エラーメッセージが表示される可能性があります: + +```{r, error=TRUE} +rm(list <- ls()) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:警告とエラー + +Pay attention when R does something unexpected! Rが予期しない動作をした場合は注意してください!エラーはRが計算を続行できない場合に発生します。一方、警告は通常、関数が実行されたものの、期待通りに動作しなかったことを意味します。 Warnings on the +other hand usually mean that the function has run, but it probably +hasn't worked as expected. + +どちらの場合も、Rが表示するメッセージには問題を解決するための手がかりが含まれていることが多いです。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Rパッケージ + +Rにはパッケージを作成することで関数を追加することができます。また、他の人が作成したパッケージを利用することも可能です。この執筆時点で、CRAN(Comprehensive R Archive Network)には10,000を超えるパッケージが利用可能です。RとRStudioにはパッケージを管理するための機能があります: As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). をインストールする別の方法は次のとおりです: + +- インストールされているパッケージを確認するには、`installed.packages()`を入力します。 +- パッケージをインストールするには、`install.packages("packagename")`と入力します。ここで`packagename`はパッケージ名で、引用符で囲みます。 +- インストール済みのパッケージを更新するには、`update.packages()`を入力します。 +- パッケージを削除するには、`remove.packages("packagename")`を使用します。 +- パッケージを利用可能にするには、`library(packagename)`を入力します。 + +RStudioの右下ペインの「Packages」タブからもパッケージを表示、ロード、デタッチすることができます。このタブをクリックすると、インストール済みのパッケージがチェックボックス付きで表示されます。パッケージ名の横にあるチェックボックスがオンの場合、そのパッケージはロードされており、オフの場合はロードされていません。空のボックスをクリックするとそのパッケージがロードされ、チェックボックスをクリックするとパッケージがデタッチされます。 Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package. + +また、「Packages」タブの上部にある「Install」ボタンと「Update」ボタンを使用して、パッケージをインストールおよび更新できます。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 2 + +次のプログラムの各文の後で、各変数の値はどうなるでしょうか? + +```{r, eval=FALSE} +mass <- 47.5 +age <- 122 +mass <- mass * 2.3 +age <- age - 20 +``` + +::::::::::::::: solution + +## チャレンジ2の解答 + +```{r ch2pt1-sol} +mass <- 47.5 +``` + +この時点で変数`mass`の値は`r mass`になります。 + +```{r ch2pt2-sol} +age <- 122 +``` + +この時点で変数`age`の値は`r age`になります。 + +```{r ch2pt3-sol} +mass <- mass * 2.3 +``` + +既存の値`r mass/2.3`に2.3を掛け、新しい値`r mass`を`mass`に格納します。 + +```{r ch2pt4-sol} +age <- age - 20 +``` + +既存の値`r age + 20`から20を引き、新しい値`r age`を`age`に格納します。 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 3 + +前のチャレンジのコードを実行し、massとageを比較するコマンドを書きなさい。massはageより大きいですか? massはageよりも大きいでしょうか? + +::::::::::::::: solution + +## チャレンジ3の解答 + +この質問に答える方法の1つとして、次のように`>`を使用できます: + +```{r ch3-sol} +mass > age +``` + +このコードは、`r mass`が`r age`より大きいため、論理値`TRUE`を返すはずです。 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 4 + +作業環境を整理し、massとageの変数を削除しなさい。 + +::::::::::::::: solution + +## チャレンジ4の解答 + +このタスクを達成するには、`rm`コマンドを使用します: + +```{r ch4-sol} +rm(age, mass) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 5 + +以下のパッケージをインストールしなさい:`ggplot2`, `plyr`, `gapminder` + +::::::::::::::: solution + +## チャレンジ5の解答 + +必要なパッケージをインストールするには、`install.packages()`コマンドを使用します。 + +```{r ch5-sol, eval=FALSE} +install.packages("ggplot2") +install.packages("plyr") +install.packages("gapminder") +``` + +1つの`install.packages()`コマンドで複数のパッケージ + +```{r ch5-sol2, eval=FALSE} +install.packages(c("ggplot2", "plyr", "gapminder")) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +`ggplot2`をインストールする際、一部のユーザーは依存関係フラグを使用する必要がある場合があります。これは既知のバグではなく、ワークショップの実施中に確認されたエラーを解決するための推奨事項です: This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop: + +```{r ch5-sol3, eval=FALSE} +install.packages("ggplot2", dependencies = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- RStudioを使用してRプログラムを作成および実行します。 +- Rには通常の算術演算子と数学関数があります。 +- `<-`を使用して変数に値を代入します。 +- `ls()`を使用してプログラム内の変数を一覧表示します。 +- `rm()`を使用してプログラム内のオブジェクトを削除します。 +- `install.packages()`を使用してパッケージ(ライブラリ)をインストールします。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/02-project-intro.Rmd b/locale/ja/episodes/02-project-intro.Rmd new file mode 100644 index 000000000..a382cf5db --- /dev/null +++ b/locale/ja/episodes/02-project-intro.Rmd @@ -0,0 +1,232 @@ +--- +title: RStudio を使ったプロジェクト管理 +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- RStudio で自己完結型のプロジェクトを作成する + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- R でプロジェクトをどのように管理できますか? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## はじめに + +科学的なプロセスは本質的に段階的なものであり、多くのプロジェクトはランダムなメモ、一部のコード、次に原稿と進行し、最終的にはすべてが混ざり合ってしまうことがよくあります。 + + + + +ほとんどの人はプロジェクトを次のように整理しがちです: + +![](fig/bad_layout.png){alt='悪いプロジェクト構成を示すファイルマネージャーのスクリーンショット'} + +このような方法を_絶対に_避けるべき理由は数多くあります: + +1. データのどのバージョンがオリジナルで、どれが修正済みなのかを区別するのが非常に難しい。 +2. 様々な拡張子のファイルが混在して、非常に散らかる。 +3. 必要なものを見つけたり、正確なコードで生成した正しい図表を関連付けたりするのに非常に時間がかかる。 + +良いプロジェクト構成は、最終的に生活をより簡単にします: + +- データの整合性を確保しやすくなる。 +- 他の人(研究室の同僚、共同研究者、指導教員)とコードを共有するのが簡単になる。 +- 原稿の投稿時にコードを簡単にアップロードできる。 +- しばらく休んだ後にプロジェクトを再開しやすくなる。 + +## 考えられる解決策 + +幸いなことに、作業を効果的に管理するためのツールやパッケージが存在します。 + +RStudio の最も強力で便利な機能の一つがプロジェクト管理機能です。本日はこれを使って自己完結型の再現可能なプロジェクトを作成します。 今日では、必要なものが揃い、再現可能なプロジェクトを作成するために、これが使われているのでしょう。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 1: 自己完結型プロジェクトの作成 + +RStudio で新しいプロジェクトを作成します: + +1. 「File」メニューをクリックし、「New Project」を選択します。 +2. 「New Directory」をクリックします。 +3. 「New Project」をクリックします。 +4. プロジェクトを保存するディレクトリの名前(例:`my_project`)を入力します。 +5. 「Create a git repository」のチェックボックスが表示される場合は選択します。 +6. 「Create Project」ボタンをクリックします。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +作成した RStudio プロジェクトを開く最も簡単な方法は、ファイルシステムをたどって保存したディレクトリに移動し、`.Rproj` ファイルをダブルクリックすることです。これにより RStudio が開き、R セッションが `.Rproj` ファイルと同じディレクトリで開始します。データ、プロット、スクリプトはすべてプロジェクトディレクトリに関連付けられます。さらに、RStudio プロジェクトは複数のプロジェクトを同時に開くことが可能で、それぞれのプロジェクトディレクトリに分離されます。これにより、複数のプロジェクトを開いても相互に干渉しません。 RStudio ウィンドウ上部のメニューで「Session」をクリックし、「Set Working Directory」を選択して「Choose Directory」をクリックします。その後、開いたウィンドウでプロジェクトディレクトリに戻り、「Open」をクリックします。コンソールに `setwd` コマンドが自動的に表示されます。 All your data, plots and scripts will now be +relative to the project directory. 出力を管理する方法はたくさんあります。各分析ごとに異なるサブディレクトリを持つ出力フォルダを用意すると、後で便利です。多くの分析は探索的で最終プロジェクトに使用されないことが多く、一部の分析はプロジェクト間で共有されることもあります。 This allows you to keep multiple projects open without them +interfering with each other. + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 2: ファイルシステムを使った RStudio プロジェクトの開き方 + +1. RStudio を終了します。 +2. チャレンジ 1 で作成したプロジェクトのディレクトリに移動します。 +3. そのディレクトリ内の `.Rproj` ファイルをダブルクリックします。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## プロジェクト管理のベストプラクティス + +プロジェクトを整理するための「ベスト」な方法はありませんが、管理を容易にするために従うべきいくつかの一般原則があります: + +### データを読み取り専用として扱う + +プロジェクトを始めるにあたって、これが多分一番重要なゴールかもしれません。 データ収集には、多くの時間と費用のいずれか、または両方が掛かることが多いものです。 データの修正も行える形で読込みも書込みもできる作業(例えば、エクセル)をすると、 データがどこからきたか、または収集されてからどう修正されてきたかが分からなくなります。 +ですから、データは「読み込むだけ」のものと扱うのがよいというわけです。 + +### データのクリーニング + +多くの場合、データは「汚れて」います: R(または、他のプログラミング言語)が使える形にするためには、かなりの前処理が必要となるでしょう。 +この作業は、「データ・マンジング」と呼ばれることもあります。 Storing these scripts in a +separate folder, and creating a second "read-only" data folder to hold the +"cleaned" data sets can prevent confusion between the two sets. + +### 生成された出力を使い捨てとみなす + +スクリプトによって生成されたものはすべて使い捨てとみなすべきです:スクリプトからすべてを再生成できる必要があります。 + +There are lots of different ways to manage this output. Having an output folder +with different sub-directories for each separate analysis makes it easier later. +Since many analyses are exploratory and don't end up being used in the final +project, and some of the analyses get shared between projects. + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント: 科学的コンピューティングのための「十分に良い」実践 + +[科学的コンピューティングのための「十分に良い」実践](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) では、プロジェクトの構成について以下の推奨事項を挙げています: + +1. 各プロジェクトを専用のディレクトリに配置し、そのディレクトリにプロジェクト名を付ける。 +2. プロジェクトに関連するテキスト文書を `doc` ディレクトリに配置する。 +3. 生データとメタデータを `data` ディレクトリに、クリーンアップや分析中に生成されたファイルを `results` ディレクトリに配置する。 +4. プロジェクトのスクリプトやプログラムのソースを `src` ディレクトリに配置し、他から持ち込んだプログラムやローカルでコンパイルしたプログラムを `bin` ディレクトリに配置する。 +5. すべてのファイルに内容や機能を反映した名前を付ける。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### 関数の定義と適用を分離する + +R を効率的に使用する最も効果的な方法の一つは、最初に .R スクリプトに実行したいコードを書き、RStudio のキーボードショートカットを使用するか「Run」ボタンをクリックして、選択した行をインタラクティブな R コンソールで実行することです。 + +プロジェクトの初期段階では、最初の .R スクリプトファイルに多くの直接実行されるコード行が含まれることがよくあります。プロジェクトが進むにつれて、再利用可能なコードチャンクが独自の関数に分離されます。これらの関数を保存するためのフォルダと分析スクリプトを保存するためのフォルダを分けるのが良いアイデアです。 プロジェクトが進むにつれて、何度も使える部分は、独自の関数としてまとめられます。 これらの関数を、色々なプロジェクトや分析で使える関数を保存するフォルダと、この分析のスクリプトを保存するフォルダの2つの異なるフォルダに分けるとよいでしょう。 + +### データを data ディレクトリに保存する + +よいディレクトリ構造ができた後は、データファイルを `data/` ディレクトリに置く、または保管しましょう。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 3 + +[このリンクから CSV ファイルをダウンロード](data/gapminder_data.csv)してください。 + +1. ファイルをダウンロードします(上記リンクを右クリック -> 「リンク先を名前を付けて保存」/「名前を付けて保存」、またはリンクをクリックしページが読み込まれた後に Ctrl+S を押すか、メニューの「ファイル」 -> 「ページを名前を付けて保存」を選択)。 +2. `gapminder_data.csv` という名前で保存されていることを確認します。 +3. ファイルをプロジェクト内の `data/` フォルダに保存します。 + +後ほどこのデータを読み込み、確認します。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 4 + +R に読み込む前に、コマンドラインからデータセットについての一般的な情報を得ることは有益です。これにより、R に読み込む際の判断に役立ちます。コマンドラインシェルを使用して以下の質問に答えてください: + +1. ファイルのサイズはどれくらいですか? +2. このファイルには何行のデータがありますか? +3. このファイルにはどのような値が含まれていますか? + +::::::::::::::: solution + +## チャレンジ 4 の解答 + +次のコマンドをシェルで実行します: + +```{r ch2a-sol, engine="sh"} +ls -lh data/gapminder_data.csv +``` + +ファイルサイズは 80K です。 + +```{r ch2b-sol, engine="sh"} +wc -l data/gapminder_data.csv +``` + +行数は 1705 行です。データの内容は次のようになります: The data looks like: + +```{r ch2c-sol, engine="sh"} +head data/gapminder_data.csv +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント: RStudio のコマンドライン + +RStudio のコンソールペインにある「Terminal」タブを使用すると、RStudio 内で直接コマンドラインを操作できます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### 作業ディレクトリ + +R の現在の作業ディレクトリを知ることは重要です。なぜなら、他のファイルにアクセスする必要があるとき(例:データファイルをインポートする場合)、R は現在の作業ディレクトリを基準にそれらのファイルを探すからです。 + +リ構造ができたら、データファイルを `data/` ディレクトリに保存します。 新しい RStudio プロジェクトを作成するたびに、そのプロジェクトの新しいディレクトリが作成されます。既存の `.Rproj` ファイルを開くと、そのプロジェクトが開き、R の作業ディレクトリがそのファイルがあるフォルダに設定されます。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 5 + +`getwd()` コマンドを使用するか、RStudio のメニューを使って現在の作業ディレクトリを確認します。 + +1. コンソールで `getwd()`("wd" は "working directory" の略)と入力し、Enter を押します。 +2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). ファイルペインで、`data` フォルダをダブルクリックして開く(または他の任意のフォルダに移動)。作業ディレクトリに戻るには、ファイルペインの「More」をクリックし、「Go To Working Directory」を選択します。 + +`setwd()` コマンドを使用するか、RStudio のメニューを使って作業ディレクトリを変更します。 + +1. In the console, type `setwd("data")` and hit Enter. コンソールで `setwd("data")` と入力し、Enter を押します。その後、`getwd()` と入力して Enter を押し、新しい作業ディレクトリを確認します。 +2. 良好なディレクト Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント: ファイルが存在しないエラー + +R コードでファイルを参照しようとして「ファイルが存在しない」というエラーが出た場合は、作業ディレクトリを確認するのが良いです。 +ファイルへの絶対パスを指定するか、作業ディレクトリ内(またはそのサブフォルダ)にファイルを保存し、相対パスを指定する必要があります。 +多くの場合、データは「汚れて」おり、R(または他のプログラミング言語)で有用な形式にするために大幅な前処理が必要です。このタスクは「データマンジング」と呼ばれることもあります。これらのスクリプトを別のフォルダに保存し、クリーンなデータセットを保持する「読み取り専用」データフォルダを作成することで、両者の混同を防ぐことができます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### バージョン管理 + +プロジェクトでは、バージョン・コントロールを使うことが重要です。 プロジェクトではバージョン管理を使用することが重要です。[RStudio で Git を使用する方法についての良いレッスンはこちら](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html)を参照してください。 + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- RStudio を使用して一貫したレイアウトでプロジェクトを作成および管理する。 +- 生データを読み取り専用として扱う。 +- 生成された出力を使い捨てとみなす。 +- 関数の定義と適用を分離する。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/03-seeking-help.Rmd b/locale/ja/episodes/03-seeking-help.Rmd new file mode 100644 index 000000000..0af6ecf08 --- /dev/null +++ b/locale/ja/episodes/03-seeking-help.Rmd @@ -0,0 +1,257 @@ +--- +title: ヘルプの利用 +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- 関数や特殊な演算子に関する R のヘルプファイルを読むことができる。 +- 問題を解決するためのパッケージを特定するために CRAN タスクビューを利用できる。 +- 仲間に助けを求める方法を理解する。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- R でどのようにしてヘルプを得ることができますか? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## ヘルプファイルを読む + +R の大きな課題の一つは、利用可能な関数の数が膨大であることです。 +すべての関数の正しい使用法を記憶するのは現実的ではありません。 +しかし、ヘルプファイルを利用すれば、記憶する必要はありません! R および各パッケージには関数用のヘルプファイルが用意されています。特定のパッケージに含まれる関数についてヘルプを検索する際の一般的な構文は次の通りです: + +```{r, eval=FALSE} +?function_name +help(function_name) +``` + +たとえば、`write.table()` のヘルプファイルを見てみましょう。この関数に似た機能を持つ関数を今後のエピソードで使用します。 + +```{r, eval=FALSE} +?write.table() +``` + +これにより、RStudio ではヘルプページが表示され、R 本体ではプレーンテキストとして表示されます。 + +各ヘルプページは次のようなセクションに分かれています: + +- **Description(説明)**: 関数が何をするかの詳細な説明。 +- **Usage(使用法)**: 関数の引数とそのデフォルト値(変更可能)。 +- **Arguments(引数)**: 各引数が期待するデータの説明。 +- **Details(詳細)**: 注意すべき重要な点。 +- **Value(戻り値)**: 関数が返すデータ。 +- - See Also(ついでにこっちも):他に役立ちそうな関連する関数 +- **Examples(例)**: 関数の使用例。 + +関数によってはセクションが異なる場合がありますが、これらが主なポイントです。 + +関連する関数が同じヘルプファイルを参照する場合があることに注意してください: + +```{r, eval=FALSE} +?write.table() +?write.csv() +``` + +これらの関数は非常に似た用途を持ち、引数も共通しているため、パッケージ作者が同じヘルプファイルで文書化していることがよくあります。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント: 実例を実行する + +ヘルプページ内の Examples セクションからコードをハイライトして Ctrl\+Return を押すと、RStudio コンソールで実行されます。 関数の動作を素早く理解する方法です。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント: ヘルプファイルを読む + +Rの気が滅入る点は、多くの関数があるという点です。 自分が使う全ての関数の正しい使い方を覚えるのは不可能とまでは言いませんが、かなり大変でしょう。 Luckily, using the help files +means you don't have to remember that! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 特殊な演算子 + +特殊な演算子に関するヘルプを検索するには、引用符またはバッククォートを使用します: + +```{r, eval=FALSE} +?"<-" +?`<-` +``` + +## パッケージに関するヘルプ + +Many packages come with "vignettes": tutorials and extended example documentation. +多くのパッケージには「ビネット」と呼ばれるチュートリアルや拡張的な例のドキュメントが含まれています。 +引数なしで `vignette()` を実行すると、インストール済みのすべてのパッケージのビネットが表示されます。 +特定のパッケージについては `vignette(package="パッケージ名")` を使用します。 +特定のビネットを開くには `vignette("ビネット名")` を実行します。 + +パッケージにビネットがない場合は、通常、次のコマンドでヘルプを探せます: + +また、RStudio には多くのパッケージ向けに優れた +[チートシート](https://rstudio.com/resources/cheatsheets/) があります。 + +## 関数名の一部を覚えている場合 + +関数がどのパッケージに属しているか、または正確なスペルがわからない場合、ファジー検索が可能です: + +```{r, eval=FALSE} +??function_name +``` + +A fuzzy search is when you search for an approximate string match. For example, you may remember that the function +to set your working directory includes "set" in its name. ファジー検索では文字列の近似一致を検索します。たとえば、作業ディレクトリを設定する関数に「set」が含まれていることを覚えている場合、次のように検索できます: + +```{r, eval=FALSE} +??set +``` + +## どこから始めるべきかわからない場合 + +どの関数やパッケージを使用すべきかわからない場合、 +[CRAN Task Views](https://cran.at.r-project.org/web/views) を利用するとよいでしょう。 +これは、パッケージを分野別にグループ化した特別なリストで、出発点として適しています。 このページから探し始めるといいかもしれません。 + +## コードが動作しない場合: 仲間に助けを求める + +関数の使用に問題がある場合、その答えのほとんどはすでに +[Stack Overflow](https://stackoverflow.com/) で回答されています。 +`[r]` タグを使って検索してください。質問の仕方については、Stack Overflow の +[良い質問の仕方](https://stackoverflow.com/help/how-to-ask)のページを参照してください。 `[r]` タグを使って検索ができます。 Please make sure to see their page on +[how to ask a good question.](https://stackoverflow.com/help/how-to-ask) + +答えが見つからない場合、以下の便利な関数を使って仲間に助けを求めるとよいでしょう: + +```{r, eval=FALSE} +?dput +``` + +この関数は、使用しているデータを他の人が自分の R セッションでコピー&ペーストできる形式に出力します。 + +```{r} +sessionInfo() +``` + +これは、現在使っている R のバージョン、そして読み込まれている全てのパッケージを表示させる関数です。 他の人が問題点を再現し、バグを見つける際にこの情報が役立つこともあります。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 1 + +この関数は、R の現在のバージョンやロードしているパッケージを表示します。 +これは他の人が問題を再現し、デバッグするのに役立ちます。 `c` 関数のヘルプページを見てください。次のコードを評価した場合、どのようなベクトルが作成されると思いますか? + +```{r, eval=FALSE} +c(1, 2, 3) +c('d', 'e', 'f') +c(1, 2, 'f') +``` + +::::::::::::::: solution + +## チャレンジ 1 の解答 + +`c()` 関数はすべての要素が同じ型のベクトルを作成します。最初の場合、要素は数値型、 +2 番目の場合は文字型、そして 3 番目の場合も文字型です。数値型の値は文字型に「強制変換」されます。 In the first case, the elements are numeric, in the +second, they are characters, and in the third they are also characters: +the numeric values are "coerced" to be characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 2 + +**See Also(関連項目)**: 有用な関連関数。 You will need to use it later. +`paste` 関数のヘルプを見てください。この関数を後ほど使用します。 +`sep` 引数と `collapse` 引数の違いは何ですか? + +::::::::::::::: solution + +## チャレンジ 2 の解答 + +`paste()` 関数のヘルプを見るには以下を実行します: + +```{r, eval=FALSE} +help("paste") +?paste +``` + +The difference between `sep` and `collapse` is a little +tricky. The `paste` function accepts any number of arguments, each of which +can be a vector of any length. The `sep` argument specifies the string +used between concatenated terms — by default, a space. The result is a +vector as long as the longest argument supplied to `paste`. In contrast, +`collapse` specifies that after concatenation the elements are _collapsed_ +together using the given separator, the result being a single string. + +引数を明示的に指定することが重要です。たとえば `sep = ","` と入力すると、関数は区切り文字として "," を使用し、結合する項目としてではないと認識します。 +e.g. + +```{r} +paste(c("a","b"), "c") +paste + +(c("a","b"), "c", ",") +paste(c("a","b"), "c", sep = ",") +paste(c("a","b"), "c", collapse = "|") +paste(c("a","b"), "c", sep = ",", collapse = "|") +``` + +(詳細については、`?paste` ヘルプページの末尾の例を参照するか、`example('paste')` を試してください。) + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 3 + +ヘルプを使用して、タブ区切り(`\t`)の列と小数点が "."(ピリオド)で表される表形式のファイルを読み込むために使用できる関数(および関連するパラメータ)を見つけてください。 +特に国際的な同僚と協力している場合、小数点の表記(例:コンマ対ピリオド)は異なる場合があるため、この確認が重要です。 +ヒント:`??"read table"` を使用して表形式データの読み込みに関連する関数を調べてください。 This check for decimal +separator is important, especially if you are working with international +colleagues, because different countries have different conventions for the +decimal point (i.e. comma vs period). +Hint: use `??"read table"` to look up functions related to reading in tabular data. + +::::::::::::::: solution + +## チャレンジ 3 の解答 + +`sep` と `collapse` の違いは少し複雑です。`paste` 関数は任意の数の引数を受け取り、それぞれが任意の長さのベクトルであることができます。 +`sep` 引数は連結される各項目の間に使用される文字列を指定します(デフォルトはスペース)。 +結果は、`paste` に渡された最も長い引数と同じ長さのベクトルです。 +一方、`collapse` 引数は、連結後の要素を指定された区切り文字を使用して「まとめて結合」することを示します。 +その結果、単一の文字列になります。 タブ区切りファイルを小数点がピリオドで表される形式で読み込む標準的な R 関数は `read.delim()` です。 +また、`read.table(file, sep="\t")` を使用することもできます(`read.table()` のデフォルトの小数点はピリオドです)。 +ただし、データファイルにハッシュ(#)文字が含まれている場合は、`comment.char` 引数を変更する必要があるかもしれません。 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## その他のリソース + +- [Quick R](https://www.statmethods.net/) +- [RStudio チートシート](https://www.rstudio.com/resources/cheatsheets/) +- [Cookbook for R](https://www.cookbook-r.com/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- R のオンラインヘルプを取得するには `help()` を使用します。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/04-data-structures-part1.Rmd b/locale/ja/episodes/04-data-structures-part1.Rmd new file mode 100644 index 000000000..6d2cfd6fc --- /dev/null +++ b/locale/ja/episodes/04-data-structures-part1.Rmd @@ -0,0 +1,1011 @@ +--- +title: データ構造 +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- 5 つの主なデータ型を特定できるようになる。 +- データフレームを探索し始め、ベクトルやリストとの関連を理解する。 +- R からオブジェクトの型、クラス、構造に関する質問ができるようになる。 +- "names"、"class"、"dim" 属性の情報を理解する。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- R でデータをどのように読み取ることができますか? +- R の基本的なデータ型は何ですか? +- R でカテゴリ情報をどのように表現しますか? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +options(stringsAsFactors = FALSE) +cats_orig <- data.frame(coat = c("calico", "black", "tabby"), weight = c(2.1, 5, 3.2), likes_catnip = c(1, 0, 1), stringsAsFactors = FALSE) +cats_bad <- data.frame(coat = c("calico", "black", "tabby", "tabby"), weight = c(2.1, 5, 3.2, "2.3 or 2.4"), likes_catnip = c(1, 0, 1, 1), stringsAsFactors = FALSE) +cats <- cats_orig +``` + +R の最も強力な機能の 1 つは、スプレッドシートや CSV ファイルにすでに保存されているような表形式データを処理する能力です。まずは、`data/` ディレクトリに `feline-data.csv` という名前の小さなデータセットを作成しましょう: まず、 `data/` ディレクトリに `feline-data.csv` というお試しのデータセットを作ってみましょう。 + +```{r} +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_catnip = c(1, 0, 1)) +``` + +次に、`cats` を CSV ファイルとして保存します。 引数名を明示的に指定することは良い習慣であり、関数が変更されたデフォルト値を認識できます。 この場合は `row.names = FALSE` を設定しています。 引数名やそのデフォルト値を確認するには、`?write.csv` を使用してヘルプファイルを表示してください。 + +```{r} +write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE) +``` + +新しいファイル `feline-data.csv` の内容は次の通りです: + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +### ヒント: R でテキストファイルを編集する + +または、テキストエディタ(Nano)や RStudio の **File -> New File -> Text File** メニュー項目を使用して `data/feline-data.csv` を作成することもできます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +このデータを R に読み込むには、以下のコマンドを使用します: + +```{r} +cats <- read.csv(file = "data/feline-data.csv") +cats +``` + +`read.table` 関数は、CSV ファイル(csv = comma-separated values)などのテキストファイルに保存された表形式データを読み取るために使用されます。タブやカンマは、CSV ファイルでデータポイントを区切るために最も一般的に使用される記号です。R では `read.table` の便利なバージョンとして `read.csv`(データがカンマで区切られている場合)と `read.delim`(データがタブで区切られている場合)が用意されています。この 3 つの中で、`read.csv` が最も一般的に使用されます。必要に応じて、デフォルトの区切り記号を変更することもできます。 タブとコンマは、csvファイルでデータ点を区切る、又は分けるために使われる 最も一般的な句読文字です。 +便宜上、Rでは、他に2つの`read.table`のバージョンが提供されています。 ひとつは、データがコンマで分けられているファイルのための `read.csv` 、 データがタブで分けられているファイルのための `read.delim` です。 これら3つの関数のうち、`read.csv` が最も広く使われています。 必要であれば、 `read.csv` と `read.delim`、両方の デフォルトの句読記号を置き換えることができます。 + +::::::::::::::::::::::::::::::::::::::::: callout + +### データが因子かどうかを確認する + +最近、R がテキストデータを処理する方法が変更されました。 以前は、R はテキストデータを自動的に "因子" という形式に変換していました。 が、現在は "文字列" という形式で処理されるようになりました。 因子の使用用途については後ほど学びますが、ほとんどの場合は必要なく、使用することで複雑になるだけです。 そのため、新しい R バージョンではテキストデータが "文字列" として読み取られます。 因子が自動的に作成されているかを確認し、必要に応じて文字列形式に変換してください: + +1. 入力データの型を確認するには、`str(cats)` を入力します。 +2. 出力で、コロンの後にある 3 文字のコードを確認します:`num` と `chr` のみが表示される場合は、レッスンを続けることができます。このボックスはスキップしてください。`fct` が見つかった場合は、次の手順に進んでください。 + `fct` が見つかった場合は、手順の3に進んでください。 +3. R が因子データを自動的に作成しないようにするには、以下のコードを実行します: `options(stringsAsFactors = FALSE)`。 その後、`cats` テーブルを再読み込みして変更を反映させます。 +4. R を再起動するたびに、このオプションを設定する必要があります。忘れないように、データを読み込む前にスクリプトの最初の行のいずれかに含めてください。 +5. R バージョン 4.0.0 以降では、テキストデータは因子に変換されなくなりました。問題を回避するためにこのバージョン以降をインストールすることを検討してください。研究所や会社のコンピュータを使用している場合は、管理者に依頼してください。 問題を回避するためにこのバージョン以降をインストールすることを検討してください。 研究所や会社のコンピュータを使用している場合は、管理者に依頼してください。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +データセットをすぐに探索し始めることができます。たとえば、`$` 演算子を使用して列を指定します: + +```{r} +cats$weight +cats$coat +``` + +列に対して操作を実行することもできます: + +```{r} +## たとえば、スケールが 2kg 軽いことが判明した場合: +cats$weight + 2 +paste("My cat is", cats$coat) +``` + +でも、こうしたらどうだろう + +```{r} +cats$weight + cats$coat +``` + +ここで何が起こったのかを理解することが、R でデータを成功裏に分析する鍵です。 + +### データ型 + +最後のコマンドがエラーを返す理由が `2.1` と `"black"` を加算するのは無意味だからだと推測したなら、あなたは正しいです!これはプログラミングにおける重要な概念である _データ型_ に関する直感をすでに持っているということです。データの型を調べるには、次のように入力します: データ型が何かを知るには、以下を使います: + +```{r} +typeof(cats$weight) +``` + +主なデータ型は次の 5 種類です:`double`、`integer`、`complex`、`logical`、`character`。 +歴史的な理由で、`double` は `numeric` とも呼ばれます。 +For historic reasons, `double` is also called `numeric`. + +```{r} +typeof(3.14) +typeof(1L) # L サフィックスを付けると数値を整数に強制します(R はデフォルトで浮動小数点数を使用) +typeof(1+1i) +typeof(TRUE) +typeof('banana') +``` + +分析がどれだけ複雑であっても、R ではすべてのデータがこれらの基本的なデータ型のいずれかとして解釈されます。この厳格さには非常に重要な意味があります。 この厳格性によって、とても重要なことが後々起こることもあります。 + +あるユーザーが他の猫の詳細を加えたとします。 別の猫の詳細を追加した情報が、ファイル `data/feline-data_v2.csv` に保存されています。 + +```{r, eval=FALSE} +file.show("data/feline-data_v2.csv") +``` + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +tabby,2.3 or 2.4,1 +``` + +この新しい猫データを以前と同じ方法で読み込み、`weight` 列にどのようなデータ型が含まれているか確認します: + +```{r} +cats <- read.csv(file="data/feline-data_v2.csv") +typeof(cats + +$weight) +``` + +なんと、この weight はdouble型ではないじゃありませんか! 前と同じように計算をしようとすると、 やっかいなことになります: + +```{r} +cats$weight + 2 +``` + +何が起こったのでしょう? +私たちが扱っている `cats` データは _データフレーム_ と呼ばれるものです。 データフレームは、R で最も一般的で多用途な _データ構造_ の 1 つです。 +データフレームのそれぞれの列には異なるデータ型を混在させることはできません。 +何が起こったのでしょうか? +私たちが扱っている `cats` データは _データフレーム_ と呼ばれるものです。データフレームは、R で最も一般的で多用途な _データ構造_ の 1 つです。 +データフレームの特定の列には異なるデータ型を混在させることはできません。 +この場合、R はデータフレーム列 `weight` のすべてを _double_ として読み取らなかったため、列全体のデータ型がその列内のすべてに適した型に変わります。 + +R が CSV ファイルを読み取ると、それは _データフレーム_ として読み込まれます。そのため、`cats` CSV ファイルを読み込むと、データフレームとして保存されます。データフレームは `str()` 関数によって表示される最初の行で認識できます: Thus, when we loaded the `cats` +csv file, it is stored as a data frame. データフレームは `str()` 関数によって表示される最初の行で認識できます: + +```{r} +str(cats) +``` + +_データフレーム_ は行と列で構成され、各列は同じ数の行を持ちます。 データフレームの異なる列は異なるデータ型で構成できます(これがデータフレームを非常に柔軟にする理由です)が、特定の列内ではすべてが同じ型である必要があります(例:ベクトル、因子、リストなど)。 + +さまざまなデータ構造とそれらの振る舞いについて詳しく見てみましょう。 +猫のデータから余分な行を削除してから、もう一度読み込みしましょう: + +feline-data.csv: + +``` +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +そして RStudio 内で: + +```{r, eval=FALSE} +cats <- read.csv(file="data/feline-data.csv") +``` + +```{r, include=FALSE} +cats <- cats_orig +``` + +### ベクトルと型の強制変換 + +この挙動をよりよく理解するために、別のデータ構造である _ベクトル_ を紹介します。 + +```{r} +my_vector <- vector(length = 3) +my_vector +``` + +R におけるベクトルは、基本的に順序付けられた要素のリストです。ただし、特別な条件として、_ベクトル内のすべての要素は同じ基本データ型である必要があります_。 もし、データ型を選ばなければ、デフォルトで`logical`になりますが、好きなデータ型を持つ空のベクトルを 宣言することもできます。 + +```{r} +another_vector <- vector(mode='character', length=3) +another_vector +``` + +あるオブジェクトがベクトルかどうかを確認することもできます: + +```{r} +str(another_vector) +``` + +このコマンドのやや難解な出力は、このベクトルに含まれる基本データ型(この場合は `chr`、文字型)を示し、ベクトル内の要素数(この場合は `[1:3]`)、および実際に含まれる要素(この場合は空の文字列)を示します。 同様に次のコマンドを実行すると、 + +```{r} +str(cats$weight) +``` + +`cats$weight` もベクトルであることがわかります。_R のデータフレームに読み込まれる列はすべてベクトルです_。これが、R が列内のすべての要素を同じ基本データ型に強制する理由の根本です。 + +:::::::::::::::::::::::::::::::::::::: discussion + +### 討論 1 の解答 + +なぜ R は列に含まれるデータに対してこれほど厳格なのでしょうか? +この厳格さは私たちにどのように役立つのでしょうか? +How does this help us? + +::::::::::::::: solution + +### 討論 1 + +By keeping everything in a column the same, we allow ourselves to make simple +assumptions about our data; if you can interpret one entry in the column as a +number, then you can interpret _all_ of them as numbers, so we don't have to +check every time. This consistency is what people mean when they talk about +_clean data_; in the long run, strict consistency goes a long way to making +our lives easier in R. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Coercion by combining vectors + +明示的な内容を持つベクトルを `c()` 関数で作成できます: + +```{r} +combine_vector <- c(2,6,3) +combine_vector +``` + +これまで学んだことを踏まえて、以下は何を生み出すでしょうか。 + +```{r} +quiz_vector <- c(2,6,'3') +``` + +This is something called _type coercion_, and it is the source of many surprises +and the reason why we need to be aware of the basic data types and how R will +interpret them. これは _型の強制変換_ と呼ばれるもので、予想外の結果をもたらすことがあり、基本データ型と R がそれをどのように解釈するかを理解する必要があります。R は、異なる型(ここでは `double` と `character`)が単一のベクトルに結合される場合、それらをすべて同じ型に強制します。例を見てみましょう: Consider: + +```{r} +coercion_vector <- c('a', TRUE) +coercion_vector +another_coercion_vector <- c(0, TRUE) +another_coercion_vector +``` + +#### 型の階層 + +型の強制変換ルールは次の通りです:\ +`logical` -> `integer` -> `double` ("`numeric`") -> `complex` -> `character`\ +この矢印は「_変換される_」と読めます。たとえば、`logical` と `character` を結合すると、結果は `character` に変換されます: For +example, combining `logical` and `character` transforms the result to +`character`: + +```{r} +c('a', TRUE) +``` + +`character` ベクトルは、印刷時にクォートで囲まれていることで簡単に認識できます。 + +この流れに逆らう強制化も、`as.` 関数を使ってできます: + +```{r} +character_vector_example <- c('0','2','4') +character_vector_example +character_coerced_to_double <- as.double(character_vector_example) +character_coerced_to_double +double_coerced_to_logical <- as.logical(character_coerced_to_double) +double_coerced_to_logical +``` + +ご覧のとおり、Rがある基本のデータ型を他へ変換すると、驚くことが起こります。 R が基本データ型を他の型に強制する際に驚くべきことが起こる場合があります!型の強制変換の細かい点はさておき、重要なのは:データが予想していた形式と異なる場合、それは型の強制変換が原因である可能性が高いです。ベクトルやデータフレームの列内のすべてのデータが同じ型であることを確認してください。さもなければ、予想外の問題が発生する可能性があります! + +But coercion can also be very useful! For example, in our `cats` data +`likes_catnip` is numeric, but we know that the 1s and 0s actually represent +`TRUE` and `FALSE` (a common way of representing them). We should use the +`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is +exactly what our data represents. We can 'coerce' this column to be `logical` by +using the `as.logical` function: + +```{r} +cats$likes_catnip +cats$likes_catnip <- as.logical(cats$likes_catnip) +cats$likes_catnip +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 1 + +An important part of every data analysis is cleaning the input data. データ分析の重要な部分は、入力データのクリーンアップです。入力データがすべて同じ形式(例:数値)であることを知っていると、分析がはるかに簡単になります!型の強制変換に関する章で扱った猫のデータセットをクリーンアップしましょう。 Clean the cat data set from the chapter about +type coercion. + +#### コードテンプレートをコピー + +RStudio で新しいスクリプトを作成し、以下のコードをコピー&ペーストしてください。その後、以下のタスクを参考にギャップ(\_\_\_\_\_\_)を埋めてください。 Then +move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_). + +``` +# データを読み込み +cats <- read.csv("data/feline-data_v2.csv") + +# 1. データを表示 +_____ + +# 2. 表の概要をデータ型と共に表示 +_____(cats) + +# 3. "weight" 列の現在のデータ型 __________。 +# 正しいデータ型は: ____________。 + +# 4. 4 番目の "weight" データポイントを指定された 2 つの値の平均に修正 +cats$weight[4] <- 2.35 +# 効果を確認するためにデータを再表示 +cats + +# 5. "weight" を正しいデータ型に変換 +cats$weight <- ______________(cats$weight) + +# 自分でテストするために平均を計算 +mean(cats$weight) + +# 正しい平均値(NA ではない)が表示されたら、演習は完了です! +``` + +### タスクの手順 + +#### 1\. しかし、強制変換は非常に便利な場合もあります!たとえば、`cats` データの `likes_catnip` 列は数値型ですが、実際には 1 と 0 がそれぞれ `TRUE` と `FALSE` を表しています。このデータには `logical` 型を使用すべきです。この型は `TRUE` または `FALSE` の 2 状態を持ち、データの意味に完全に一致します。この列を `logical` に「強制変換」するには、`as.logical` 関数を使用します: + +最初のステートメント(`read.csv(...)`)を実行します。その後、データをコンソールに表示します。 Then print the data to the +console + +::::::::::::::: solution + +### ヒント 1.1 + +任意の変数の内容を表示するには、その名前を入力します。 + +### チャレンジ 1.1 の解答 + +2 つの正しい解答: + +``` +cats +print(cats) +``` + +::::::::::::::::::::::::: + +#### 2\. 「データ型」の章で、データ型を表示する 2 つの関数を見ました。1 つはデータ型の名前だけを出力し、もう 1 つは短い形式のデータ型と最初の値を出力しました。ここでは後者を使用します。 + +The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of the +`cats` table. + +::::::::::::::: solution + +### ヒント 1.2 + +In the chapter "Data types" we saw two functions that can show data types. +One printed just a single word, the data type name. The other printed +a short form of the data type, and the first few values. We need the second +here. + +::::::::::::::::::::::::: + +> ### チャレンジ 1.2 の解答 +> +> ``` +> str(cats) +> ``` + +#### 3\. データ型はデータ自体と同じくらい重要です。以前見た関数を使用して、`cats` テーブルのすべての列のデータ型を表示します。 + +表示されるデータ型は、このデータ(猫の体重)には適していません。必要なデータ型はどれですか? なデータ型はどれですか? + +- なぜ `read.csv()` 関数は正しいデータ型を選ばなかったのでしょうか? +- コメントのギャップに猫の体重に適したデータ型を埋めてください! + +::::::::::::::: solution + +### ヒント 1.3 + +[型の階層](#the-type-hierarchy) のセクションに戻り、利用可能なデータ型を確認してください。 + +::::::::::::::::::::::::: + +::::::::::::::: solution + +### チャレンジ 1.3 の解答 + +- 体重は連続スケール(実数)で表されます。この場合の R のデータ型は "double"("numeric" とも呼ばれます)です。 4 行目の値は "2.3 or 2.4" であり、数値ではなく英単語が含まれています。そのため、"character" 型が選ばれます。同じ列内のすべての値が同じデータ型である必要があるため、列全体がテキスト型になっています。 +- The fourth row has the value "2.3 or 2.4". That is not a number + but two, and an english word. Therefore, the "character" data type + is chosen. The whole column is now text, because all values in the same + columns have to be the same data type. + +::::::::::::::::::::::::: + +#### 4\. Correct the problematic value + +問題のある 4 行目に新しい体重値を割り当てるコードが提供されています。実行する前に考えてみてください。この例のように数値を割り当てた後のデータ型はどうなりますか? +実行後にデータ型を確認して、自分の予測が正しいか確認してください。 +Think first and then execute it: What will be the data type after assigning +a number like in this example? +You can check the data type after executing to see if you were right. + +::::::::::::::: solution + +### ヒント 1.4 + +2 つの異なるデータ型が組み合わされた場合の型の階層を再確認してください。 + +::::::::::::::::::::::::: + +> ### チャレンジ 1.4 の解答 +> +> 列 "weight" のデータ型は "character" です。割り当てるデータ型は "double" です。異なるデータ型を組み合わせると、次の階層でより高いデータ型に変換されます: あらら、`weight` 列の型が `double` ではなくなっています!以前と同じ計算を試みると、問題が発生します: Combining two data types yields the data type that is +> higher in the following hierarchy: +> +> ``` +> logical < integer < double < complex < character +> ``` +> +> したがって、列はまだ "character" 型です!これを "double" 型に手動で変換する必要があります。 We need to manually +> convert it to "double". +> {: .solution} + +#### 5\. Convert the column "weight" to the correct data type + +猫の体重は数値です。しかし、列にはまだ適切なデータ型が設定されていません。この列を浮動小数点数に強制変換してください。 But the column does not have this data type yet. +Coerce the column to floating point numbers. + +::::::::::::::: solution + +### ヒント 1.5 + +データ型を変換する関数は `as.` で始まります。このスクリプトの上部で関数を確認するか、RStudio のオートコンプリート機能を使用してください。 "`as.`" と入力し、TAB キーを押します。 You can look +for the function further up in the manuscript or use the RStudio +auto-complete function: Type "`as.`" and then press the TAB key. + +::::::::::::::::::::::::: + +> ### チャレンジ 1.5 の解答 +> +> 歴史的な理由で、2 つの同義の関数があります: +> +> ``` +> cats$weight <- as.double(cats$weight) +> cats$weight <- as.numeric(cats$weight) +> ``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### 基本的なベクトル関数 + +`c()` 関数を使用すると、既存のベクトルに新しい要素を追加することができます: + +```{r} +ab_vector <- c('a', 'b') +ab_vector +combine_example <- c(ab_vector, 'SWC') +combine_example +``` + +また、数列を生成することも可能です: + +```{r} +mySeries <- 1:10 +mySeries +seq(10) +seq(1, 10, by=0.1) +``` + +ベクトルについていくつかの質問をすることもできます: + +```{r} +sequence_example <- 20:25 +head(sequence_example, n=2) +tail(sequence_example, n=4) +length(sequence_example) +typeof(sequence_example) +``` + +ベクトルの特定の要素を取得するには、角括弧記法を使用します: + +```{r} +first_element <- sequence_example[1] +first_element +``` + +特定の要素を変更するには、角括弧を矢印の右側に使用します: + +```{r} +sequence_example[1] <- 30 +sequence_example +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 2 + +1 から 26 までの数を含むベクトルを作成します。その後、このベクトルを 2 倍にします。 +単一ブラケットを使用し、行と列の座標を指定します。この場合、1 行目 1 列目の値が返されます。オブジェクトは _character_ 型のベクトルです。 + +::::::::::::::: solution + +### チャレンジ 2 の解答 + +```{r} +x <- 1:26 +x <- x * 2 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### リスト + +覚えておきたいもう一つのデータ構造は、 `list` です。 リストは、他の種類よりも、ある意味シンプルです。その理由は、入れたいものを なんでも入れることができるからです: 次に紹介するデータ構造は `list` です。リストは他のデータ型よりもシンプルで、_何でも入れることができる_のが特徴です。ベクトルでは要素の基本データ型を統一する必要がありましたが、リストは異なるデータ型を持つことができます: + +```{r} +list_example <- list(1, "a", TRUE, 1+4i) +list_example +``` + +`str()` を使用してオブジェクトの構造を表示すると、すべての要素のデータ型を確認できます: + +```{r} +str(list_example) +``` + +リストの用途は何でしょうか?例えば、異なるデータ型を持つ関連データを整理できます。これは、Excel のスプレッドシートのように複数の表をまとめるのと似ています。他にも多くの用途があります。 They can **organize data of different types**. For +example, you can organize different tables that belong together, similar to +spreadsheets in Excel. But there are many other uses, too. + +次の章で、驚くかもしれない別の例を紹介します。 + +リストの特定の要素を取得するには **二重角括弧** を使用します: + +```{r} +list_example[[2]] +``` + +リストの要素には **名前** を付けることもできます。名前を値の前に等号で指定します: + +```{r} +another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE ) +another_list +``` + +これにより **名前付きリスト** が生成されます。これで新しいアクセス方法が追加されます! Now we have a new function of our object! +We can access single elements by an additional way! + +```{r} +another_list$title +``` + +## 名前 + +With names, we can give meaning to elements. It is the first time that we do not +only have the **data**, but also explaining information. It is _metadata_ +that can be stuck to the object like a label. In R, this is called an +**attribute**. Some attributes enable us to do more with our +object, for example, like here, accessing an element by a self-defined name. + +### 名前を使用してベクトルやリストにアクセスする + +名前付きリストの生成方法はすでに学びました。名前付きベクトルを生成する方法も非常に似ています。以前このような関数を見たことがあるはずです: ベクトルを結合する際の型の強制変換 You have seen this function before: + +```{r} +pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 ) +``` + +しかし、要素の取得方法はリストとは異なります: + +```{r} +pizza_price["pizzasubito"] +``` + +リストのアプローチは機能しません: + +```{r} +pizza_price$pizzafresh +``` + +It will pay off if you remember this error message, you will meet it in your own +analyses. このエラーメッセージを覚えておくと役立ちます。同じようなエラーに遭遇することが多いですが、これはリストと勘違いしてベクトルの要素にアクセスしようとした場合に発生します。 + +### 名前の取得と変更 + +名前だけに興味がある場合は、`names()` 関数を使用します: + +```{r} +names(pizza_price) +``` + +ベクトルの要素にアクセスしたり変更したりする方法を学びました。同じことが名前についても可能です: The same is +possible for names: + +```{r} +names(pizza_price)[3] +names(pizza_price)[3] <- "call-a-pizza" +pizza_price +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 3 + +- `pizza_price` の名前のデータ型は何ですか?`str()` または `typeof()` 関数を使用して調べてください。 You can find out + using the `str()` or `typeof()` functions. + +::::::::::::::: solution + +### チャレンジ 3 の解答 + +名前を使用すると、要素に意味を持たせることができます。これにより、データだけでなく説明情報も持つことができます。これはオブジェクトに貼り付けられるラベルのような _メタデータ_ です。R ではこれは **属性** と呼ばれます。属性により、オブジェクトをさらに操作することが可能になります。ここでは、定義された名前で要素にアクセスすることができます。 オブジェクトの名前を取得するには、その名前を `names(...)` で囲みます。同様に、名前のデータ型を取得するには、全体をさらに `typeof(...)` で囲みます: + +``` +typeof(names(pizza)) +``` + +または、コードをわかりやすくするために新しい変数を使用します: + +``` +n <- names(pizza) +typeof(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 4 + +既存のベクトルやリストの一部の名前を変更する代わりに、オブジェクトのすべての名前を設定することも可能です。次のコード形式を使用します(すべての大文字部分を置き換えてください): + +``` +names( OBJECT ) <- CHARACTER_VECTOR +``` + +アルファベットの各文字に番号を割り当てるベクトルを作成しましょう! + +1. 1 から 26 の数列を持つ `letter_no` というベクトルを作成します。 +2. R has a built-in object called `LETTERS`. R には `LETTERS` という組み込みオブジェクトがあります。これは A から Z までの 26 文字を含むベクトルです。この 26 文字を `letter_no` の名前として設定します。 +3. `letter_no["B"]` を呼び出して、値が 2 であることを確認してください! + +::::::::::::::: solution + +### チャレンジ 4 の解答 + +``` +letter_no <- 1:26 # or seq(1,26) +names(letter_no) <- LETTERS +letter_no["B"] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## データフレーム + +We have data frames at the very beginning of this lesson, they represent +a table of data. このレッスンの冒頭でデータフレームについて簡単に触れましたが、それはデータの表形式を表しています。例として示した猫のデータフレームについては詳細に掘り下げていませんでした: + +```{r} +cats +``` + +これで、data.frameの驚くべき特徴を理解することができます。もし以下を走らせたらどうなるでしょう: + +```{r} +typeof(cats) +``` + +データフレームが「内部的にはリストのように見える」ことがわかります。以前、リストについて次のように説明しました: Think again what we +heard about what lists can be used for: + +> リストは異なる型のデータを整理するためのもの + +データフレームの列は、それぞれが異なる型のベクトルであり、同じ表に属することで整理されています。 + +データフレームは実際にはベクトルのリストです。データフレームが特別なのは、すべてのベクトルが同じ長さでなければならない点です。 二重ブラケット `[[1]]` はリスト項目の内容を返します。この場合、最初の列の内容である _character_ 型のベクトルです。 + +この「特別さ」はどのようにオブジェクトに組み込まれているのでしょうか?R がそれを単なるリストではなく、表として扱うのはなぜでしょう? + +```{r} +class(cats) +``` + +単一ブラケット `["coat"]` を使用し、インデックス番号の代わりに列名を指定します。例 1 と同様に、返されるオブジェクトは _list_ です。 It tells +us what this object means for humans. + +You might wonder: Why do we need another what-type-of-object-is-this-function? +We already have `typeof()`? That function tells us how the object is +**constructed in the computer**. **クラス** は名前と同様に、オブジェクトに付加される属性です。この属性は、そのオブジェクトが人間にとって何を意味するのかを示します。 ここで疑問に思うかもしれません:なぜオブジェクトの型を判断するための関数がもう一つ必要なのでしょうか?すでに `typeof()` がありますよね?\ +`typeof()` はオブジェクトが**コンピュータ内でどのように構築されているか**を教えてくれます。一方、`class()` はオブジェクトの**人間にとっての意味**を示します。したがって、`typeof()` の出力は R で固定されています(主に 5 種類のデータ型)が、`class()` の出力は R パッケージによって多様で拡張可能です。 + +我々の `cats` の例では、整数型(integer)、浮動小数型(double)、論理型(logical)の変数があります。 `cats` の例では、整数型、倍精度数値型、論理型の変数が含まれています。すでに見たように、データフレームの各列はベクトルです: + +```{r} +cats$coat +cats[,1] +typeof(cats[,1]) +str(cats[,1]) +``` + +一方、各行は異なる変数の_観測値_であり、それ自体がデータフレームであり、異なる型の要素で構成されることができます: + +```{r} +cats[1,] +typeof(cats[1,]) +str(cats[1,]) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 5 + +データフレームから変数、観測値、要素を取得する方法はいくつかあります: + +- `cats[1]` +- `cats[[1]]` +- `cats$coat` +- `cats["coat"]` +- `cats[1, 1]` +- `cats[, 1]` +- `cats[1, ]` + +これらの例を試して、それぞれが何を返すのかを説明してください。 + +_ヒント:_ 返されるものを調べるには、`typeof()` 関数を使用してください。 + +::::::::::::::: solution + +### チャレンジ 5 の解答 + +```{r, eval=TRUE, echo=TRUE} +cats[1] +``` + +データフレームはベクトルのリストと考えられます。単一ブラケット `[1]` はリストの最初のスライスを別のリストとして返します。この場合、それはデータフレームの最初の列です。 The single brace `[1]` +returns the first slice of the list, as another list. In this case it is the +first column of the data frame. + +```{r, eval=TRUE, echo=TRUE} +cats[[1]] +``` + +The double brace `[[1]]` returns the contents of the list item. In this case +it is the contents of the first column, a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats$coat +``` + +This example uses the `$` character to address items by name. `$` を使用して名前で項目にアクセスします。`coat` はデータフレームの最初の列であり、_character_ 型のベクトルです。 + +```{r, eval=TRUE, echo=TRUE} +cats["coat"] +``` + +Here we are using a single brace `["coat"]` replacing the index number with +the column name. Like example 1, the returned object is a _list_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, 1] +``` + +This example uses a single brace, but this time we provide row and column +coordinates. The returned object is the value in row 1, column 1. The object +is a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats[, 1] +``` + +Like the previous example we use single braces and provide row and column +coordinates. 前の例と同様に単一ブラケットを使用し、行と列の座標を指定しますが、行座標が指定されていません。この場合、R は欠損値をその列のすべての要素として解釈し、_ベクトル_ として返します。 + +```{r, eval=TRUE, echo=TRUE} +cats[1, ] +``` + +Again we use the single brace with row and column coordinates. The column +coordinate is not specified. The return value is a _list_ containing all the +values in the first row. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +### ヒント: データフレーム列の名前変更 + +データフレームには列名があり、`names()` 関数でアクセスできます: + +```{r} +names(cats) +``` + +`cats` の 2 番目の列の名前を変更したい場合は、`names(cats)` の 2 番目の要素に新しい名前を割り当てます: + +```{r} +names(cats)[2] <- "weight_kg" +cats +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +# cats を元のバージョンに戻す +cats <- cats_orig +``` + +### 行列 + +Last but not least is the matrix. 最後に紹介するのは行列です。ゼロで満たされた行列を宣言してみましょう: + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +matrix_example +``` + +行列を特別なものにしているのは `dim()` 属性です: + +```{r} +dim(matrix_example) +``` + +他のデータ構造と同様に、行列について質問することも可能です: + +```{r} +typeof(matrix_example) +class(matrix_example) +str(matrix_example) +nrow(matrix_example) +ncol(matrix_example) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 6 + +次のコードの結果はどうなるでしょうか? +Try it. +Were you right? Why / why not? + +::::::::::::::: solution + +### チャレンジ 6 の解答 + +length(matrix_example) + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +length(matrix_example) +``` + +行列は次元属性を持つベクトルであるため、`length` は行列内の要素の総数を返します: + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 7 + +もう一つ行列を作ってみましょう、今回は、1:50の数を含むもので、 5行、10列を持つ行列にしましょう。 +1 から 50 の数値を含む、列数 5、行数 10 の行列を作成します。 +デフォルトの動作として、この行列は列ごとに値が埋められますか、それとも行ごとですか? +その動作を変更する方法を調べてください。(ヒント:`matrix` のドキュメントを参照) +これがどう変化したか理解したか確認してみましょう。 +(hint: read the documentation for `matrix`!) + +::::::::::::::: solution + +### チャレンジ 7 の解答 + +もう一つ行列を作ってみましょう、今回は、1:50の数を含むもので、 5行、10列を持つ行列にしましょう。 +行列(Matrix) +これがどう変化したか理解したか確認してみましょう。 +(hint: read the documentation for `matrix`!) + +```{r, eval=FALSE} +x <- matrix(1:50, ncol=5, nrow=10) +x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # 行ごとに埋める +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 8 + +このワークショップの次のセクションに対応する 2 つの要素を持つリストを作成します: + +- データ型 +- データ構造 + +各データ型およびデータ構造の名前を文字型ベクトルに格納してください。 + +::::::::::::::: solution + +### チャレンジ 8 の解答 + +```{r} +dataTypes <- c + +('double', 'complex', 'integer', 'character', 'logical') +dataStructures <- c('data.frame', 'vector', 'list', 'matrix') +answer <- list(dataTypes, dataStructures) +``` + +Note: it's nice to make a list in big writing on the board or taped to the wall +listing all of these types and structures - leave it up for the rest of the workshop +to remind people of the importance of these basics. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### チャレンジ 9 + +R が因子データを自動的に作成しないようにするには、以下のコードを実行します:`options(stringsAsFactors = FALSE)`。その後、`cats` テーブルを再読み込みして変更を反映させます。 + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +この行列を作成するために使用された正しいコマンドはどれでしょうか?各コマンドを確認し、入力する前に正しいものを考えてください。\ +他のコマンドでどのような行列が作成されるかを考えてみてください。 それぞれのコマンドを確かめて、打ち込む前に正しいものが何か分かるようにしましょう。 +他のコマンドでは、どのような行列が作られるかを考えてみましょう。 + +1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)` +2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)` +3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)` +4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)` + +::::::::::::::: solution + +### チャレンジ 9 の解答 + +以下の行列の R 出力を考えてみてください: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +これまで学んだ内容を考えると、次のコードは何を生成すると思いますか? それぞれのコマンドを確かめて、打ち込む前に正しいものが何か分かるようにしましょう。 +他のコマンドでは、どのような行列が作られるかを考えてみましょう。 + +```{r, eval=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- `read.csv` を使用して R で表形式データを読み取ります。 +- R の基本データ型は、double、integer、complex、logical、character です。 +- データフレームや行列のようなデータ構造は、リストやベクトルを基にし、いくつかの属性が追加されています。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/05-data-structures-part2.Rmd b/locale/ja/episodes/05-data-structures-part2.Rmd new file mode 100644 index 000000000..78f89a3c8 --- /dev/null +++ b/locale/ja/episodes/05-data-structures-part2.Rmd @@ -0,0 +1,362 @@ +--- +title: データフレームの操作 +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- 行や列を追加または削除する。 +- 2 つのデータフレームを結合する。 +- データフレームのサイズ、列のクラス、名前、最初の数行などの基本的なプロパティを表示する。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- データフレームをどのように操作できますか? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +これまでに、R の基本的なデータ型とデータ構造について学びました。以降の作業は、それらのツールを操作することに集約されます。最も頻繁に登場するのは、CSV ファイルから情報を読み込んで作成するデータフレームです。このレッスンでは、データフレームの操作についてさらに学びます。 しかし大抵の場合、主役はデータフレーム(CSVファイルから情報を読み込み作成した表)です。 このレッスンでは、データフレームを使ってどう作業していくかについて更に学んでいきましょう。 + +## データフレームに列や行を追加する + +データ型や構造が合理的であることを確認したら、データの探索を開始しましょう。最初の数行を確認します: データフレームの列はベクトルであるため、列全体でデータ型が一貫しています。そのため、新しい列を追加したい場合は、まず新しいベクトルを作成します: + +```{r, echo=FALSE} +cats <- read.csv("data/feline-data.csv") +``` + +```{r} +age <- c(2, 3, 5) +cats +``` + +これを列として追加するには、次のようにします: + +```{r} +cbind(cats, age) +``` + +ただし、データフレームの行数と異なる要素数を持つベクトルを追加しようとすると失敗します: + +```{r, error=TRUE} +age <- c(2, 3, 5, 12) +cbind(cats, age) + +age <- c(2, 3) +cbind(cats, age) +``` + +Why didn't this work? なぜ失敗するのでしょうか?R は、新しい列の各行に 1 つの要素が必要だと考えています: + +```{r} +nrow(cats) +length(age) +``` + +したがって、`nrow(cats)` と `length(age)` が等しい必要があります。新しいデータフレームを作成して、`cats` に上書きしてみましょう。 Let's overwrite the content of cats with our new data frame. + +```{r} +age <- c(2, 3, 5) +cats <- cbind(cats, age) +``` + +Now how about adding rows? 次に、行を追加してみましょう。データフレームの行はリストであることを既に学びました: + +```{r} +newRow <- list("tortoiseshell", 3.3, TRUE, 9) +cats <- rbind(cats, newRow) +``` + +新しい行が正しく追加されたことを確認します。 + +```{r} +cats +``` + +## 行を削除する + +データフレームに行や列を追加する方法を学びました。次に、行を削除する方法を見てみましょう。 + +```{r} +cats +``` + +最後の行を削除したデータフレームを取得するには: + +```{r} +cats[-4, ] +``` + +コンマの後に何も指定しないことで、4 行目全体を削除することを示します。 + +複数の行を削除することもできます。たとえば、次のようにベクトル内に行番号を指定します:`cats[c(-3,-4), ]` + +## 列を削除する + +データフレームの列を削除することもできます。「age」列を削除する場合、変数番号またはインデックスを使用する方法があります。 What if we want to remove the column "age". We can remove it in two ways, by variable number or by index. + +```{r} +cats[,-4] +``` + +コンマの前に何も指定しないことで、すべての行を保持することを示します。 + +または、要素番号の名前を使って列を削除することもできます: または、インデックス名と `%in%` 演算子を使用して列を削除することもできます。`%in%` 演算子は、左側の引数(ここでは `cats` の名前)の各要素について「この要素は右側の引数に含まれますか?」と尋ねます。 + +```{r} +drop <- names(cats) %in% c("age") +cats[,!drop] +``` + +We will cover subsetting with logical operators like `%in%` in more detail in the next episode. 論理演算子(`%in%` など)による部分集合化については、次のエピソードで詳しく説明します。詳細は [論理演算を使用した部分集合化](06-data-subsetting.Rmd) を参照してください。 + +## データフレームの結合 + +データフレームにデータを追加する際に覚えておくべき重要な点は、_列はベクトル、行はリスト_であることです。2 つのデータフレームを `rbind` を使用して結合することもできます: + +```{r} +cats <- rbind(cats, cats) +cats +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 1 + +次の構文を使用して、新しいデータフレームを R 内で作成できます: + +```{r} +df <- data.frame(id = c("a", "b", "c"), + x = 1:3, + y = c(TRUE, TRUE, FALSE)) +``` + +以下の情報を持つデータフレームを作成してください: + +- 名 +- 姓 +- ラッキーナンバー + +Then use `rbind` to add an entry for the people sitting beside you. +次に、`rbind` を使用して隣の人のエントリを追加します。最後に、`cbind` を使用して「コーヒーブレイクの時間ですか?」という質問への各人の回答を含む列を追加してください。 + +::::::::::::::: solution + +## チャレンジ 1 の解答 + +```{r} +df <- data.frame(first = c("Grace"), + last = c("Hopper"), + lucky_number = c(0)) +df <- rbind(df, list("Marie", "Curie", 238) ) +df <- cbind(df, coffeetime = c(TRUE, TRUE)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 実用的な例 + +これまで、猫データを使ってデータフレーム操作の基本を学びました。次に、これらのスキルを使用して、より現実的なデータセットを扱います。以前ダウンロードした `gapminder` データセットを読み込んでみましょう: `gapminder` の構造を調べる別の方法として、`summary` 関数を使用します。この関数は R のさまざまなオブジェクトで使用できます。データフレームの場合、`summary` は各列の数値的、表形式、または記述的な概要を提供します。数値または整数型の列は記述統計(四分位数や平均値)で、文字列型の列はその長さ、クラス、モードで説明されます。 + +```{r} +gapminder <- read.csv("data/gapminder_data.csv") +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## その他のヒント + +- Another type of file you might encounter are tab-separated value files (.tsv). タブ区切り値ファイル(.tsv)を扱う場合は、区切り文字として `"\\t"` を指定するか、`read.delim()` を使用します。 + +- ここではネストされた関数(関数を別の関数の引数として渡す)を使用する一例を示します。この考え方は新しいように思えるかもしれませんが、既に使用しています。\ + 例えば、`my_dataframe[rows, cols]` は指定された行と列のデータフレームを表示します。データフレームの最後の行を取得するにはどうしますか?R にはそのための関数があります。また、(擬似ランダムな)サンプルを取得するにはどうすればよいでしょうか? + ファイルをインターネットから直接ダウンロードしてコンピュータの指定したローカルフォルダに保存するには、`download.file` 関数を使用できます。保存されたファイルを `read.csv` 関数で読み込む例: + +```{r, eval=FALSE, echo=TRUE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv("data/gapminder_data.csv") +``` + +- また、ファイルパスの代わりに Web アドレスを `read.csv` に指定して、ファイルを直接 R に読み込むこともできます。この場合、ローカルにファイルを保存する必要はありません。例: One should note that in doing this no local copy of the csv file is first saved onto your computer. 例えば: + +```{r, eval=FALSE, echo=TRUE} +gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv") +``` + +- [readxl パッケージ](https://cran.r-project.org/package=readxl) を使用すると、Excel スプレッドシートをプレーンテキストに変換せずに直接読み込むことができます。 + +- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +`gapminder` データセットを調べてみましょう。最初に行うべきことは、`str` を使用してデータの構造を確認することです: + +```{r} +str(gapminder) +``` + +An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode. + +```{r} +summary(gapminder) +``` + +`str` や `summary` 関数と合わせて、`typeof` 関数を使ってデータフレームの個々の列を調べることもできます: + +```{r} +typeof(gapminder$year) +typeof(gapminder$country) +str(gapminder$country) +``` + +データフレームの次元に関する情報も調べることができます。\ +`str(gapminder)` の出力によると、`gapminder` には 6 つの変数の 1704 個の観測値があります。このことを覚えた上で、次のコードが何を返すか考えてみてください: + +```{r} +length(gapminder) +``` + +データフレームの長さが行数(1704)であると考えるのが妥当ですが、実際にはそうではありません。データフレームは_ベクトルや因子のリスト_で構成されていることを思い出してください: + +```{r} +typeof(gapminder) +``` + +`length` が 6 を返した理由は、`gapminder` が 6 列のリストで構築されているためです。データセットの行数と列数を取得するには次のようにします: データセットで、行と列の数を知るためには、こうしてみましょう: + +```{r} +nrow(gapminder) +ncol(gapminder) +``` + +あるいは、両方を一度に取得するには: + +```{r} +dim(gapminder) +``` + +すべての列のタイトルを調べることもできます。後でアクセスする際に便利です: + +```{r} +colnames(gapminder) +``` + +ここで、R が報告する構造が自分の直感や予想と一致しているかどうかを確認することが重要です。各列のデータ型が妥当かどうかを確認してください。そうでない場合は、これまで学んだ R のデータ解釈の仕組みや、一貫性の重要性に基づいて問題を解決する必要があります。 それぞれの列の基本的なデータ型は、思った通りのデータ型になってますか?もしなっていないのなら、今後、予想外の事態を引き起こさないように、 今の時点で、問題を解決しておく必要があります。そのためには、これまでに学んだ、Rがどのようにデータを解釈するか、 そしてデータを記録する際の 厳格な整合性 の重要性といった知識を活かしましょう。 + +データ型と構造に満足することができたら、データを詳しく見始めることができます。 最初のいくつかの行を見てみましょう: + +```{r} +head(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 2 + +データの最後の数行や中間のいくつかの行も確認するのが良い習慣です。これをどのように行いますか? How would you do this? + +特に中間の行を探すのは難しくありませんが、ランダムな行をいくつか取得することもできます。これをどのようにコード化しますか? How would you code this? + +::::::::::::::: solution + +## チャレンジ 2 の解答 + +最後の数行を確認するには、R に既にある関数を使用すれば簡単です: + +```r +tail(gapminder) +tail(gapminder, n = 15) +``` + +What about a few arbitrary rows just in case something is odd in the middle? + +## ヒント: いくつかの方法で達成できます + +The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it! +Remember my\_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). では、途中の任意の行を確認するにはどうすればよいでしょうか? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this. + +```r +gapminder[sample(nrow(gapminder), 5), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +再現性のある分析を確保するために、コードをスクリプトファイルに保存し、後で再利用できるようにしましょう。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 3 + +File -> New File -> R Script に移動し、`gapminder` データセットを読み込むための R スクリプトを作成します。このスクリプトを `scripts/` ディレクトリに保存し、バージョン管理に追加してください。 Put it in the `scripts/` +directory and add it to version control. + +その後、`source` 関数を使用してスクリプトを実行します。ファイルパスを引数として指定するか、RStudio の「Source」ボタンを押します。 + +::::::::::::::: solution + +## チャレンジ 3 の解答 + +`source` 関数はスクリプト内で別のスクリプトを使用するために使用できます。同じ種類のファイルを何度も読み込む必要がある場合、一度スクリプトとして保存すれば、以降はそれを繰り返し利用できます。 +Assume you would like to load the same type of file over and over +again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again +and again you could just write it once and save it as a script. `"stringsAsFactors"` 引数を使用すると、文字列を因子として読み込むか文字列として読み込むかを指定できます。R バージョン 4.0 以降では、デフォルトで文字列は文字型として読み込まれますが、古いバージョンでは因子として読み込まれるのがデフォルトでした。詳細は[前のエピソード](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors)のコールアウトを参照してください。 +Check out `?source` to find out more. + +```{r, eval=FALSE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv(file = "data/gapminder_data.csv") +``` + +データを `gapminder` 変数に読み込むには次のようにします: + +```{r, eval=FALSE} +source(file = "scripts/load-gapminder.R") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 4 + +`str(gapminder)` の出力をもう一度読み、リストやベクトルについて学んだこと、および `colnames` や `dim` の出力を活用して、`str` が表示する内容を説明してください。理解できない部分があれば、隣の人と相談してみてください。 +今度は、順序なし因数、リスト、ベクトルについて学んだことを使いましょう。 + +::::::::::::::: solution + +## チャレンジ 4 の解答 + +オブジェクト `gapminder` はデータフレームで、列は次のようになっています: + +- `country` と `continent` は文字列(character)。 +- `year` は整数型のベクトル。 +- `pop`、`lifeExp`、`gdpPercap` は数値型のベクトル。 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- 新しい列をデータフレームに追加するには `cbind()` を使用します。 +- 新しい行をデータフレームに追加するには `rbind()` を使用します。 +- データフレームから行を削除します。 +- データフレームの構造を理解するために、`str()`、`summary()`、`nrow()`、`ncol()`、`dim()`、`colnames()`、`head()`、`typeof()` を使用します。 +- `read.csv()` を使用して CSV ファイルを読み込みます。 +- データフレームの `length()` が何を表しているのか理解します。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/06-data-subsetting.Rmd b/locale/ja/episodes/06-data-subsetting.Rmd new file mode 100644 index 000000000..9611f38c3 --- /dev/null +++ b/locale/ja/episodes/06-data-subsetting.Rmd @@ -0,0 +1,820 @@ +--- +title: Subsetting Data +teaching: 35 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to subset vectors, factors, matrices, lists, and data frames +- To be able to extract individual and multiple elements: by index, by name, using comparison operations +- To be able to skip and remove elements from various data structures. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I work with subsets of data in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +R has many powerful subset operators. Mastering them will allow you to +easily perform complex operations on any kind of dataset. + +オブジェクトを部分集合する方法は6つあり、データ構造を 部分集合する方法は3つあります。 + +Rの働き頭、数値ベクトルから始めましょう。 + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +x +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## 原子ベクトル + +Rでは、文字列、数値、論理値を含む単純なベクトルは、 原子(atomic) ベクトルと呼ばれています。その理由は、原子ベクトルはそれ以上単純化できないからです。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +練習用のベクトルを作ることができたのですが、どうやってベクトルの中身を使うのでしょう? + +## 要素番号で要素を使う + +ベクトルの要素を抽出するためには、対応する1から始まる要素番号を使います: + +```{r} +x[1] +``` + +```{r} +x[4] +``` + +It may look different, but the square brackets operator is a function. For vectors +(and matrices), it means "get me the nth element". + +複数の要素を一度に頼むこともできます: + +```{r} +x[c(1, 3)] +``` + +または、ベクトルのスライスを頼むこともできます: + +```{r} +x[1:4] +``` + +この `:` 演算子は、左から右の要素の一連番号を作ります。 + +```{r} +1:4 +c(1, 2, 3, 4) +``` + +同じ要素を何度も頼むこともできます: + +```{r} +x[c(1,1,3)] +``` + +もしベクトルの長さ以上の要素番号を頼んだ場合、Rは欠測値を返します: + +```{r} +x[6] +``` + +これは、 `NA` を含む、`NA` という名前の長さ1のベクトルです。 + +もし、0番目の要素を頼んだ場合、空ベクトルが返ってきます: + +```{r} +x[0] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Rのベクトル番号は、1から始まる + +多くのプログラミング言語(例えば、C、Python)では、ベクトルの最初の 要素の要素番号は0です。Rでは、最初の要素番号は1です。 In R, the first element is 1. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 要素を飛ばす、削除する + +もし、負の番号をベクトルの要素番号として使った場合、Rは指定された番号 以外の 全ての要素を返します: + +```{r} +x[-2] +``` + +複数の要素を飛ばすこともできます: + +```{r} +x[c(-1, -5)] # or x[-c(1,5)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:演算の順番 + +初心者によく見られるのが、ベクトルのスライスを飛ばそうとする時に起こる間違いです。 It's natural to try to negate a +sequence like so: + +```{r, error=TRUE, eval=FALSE} +x[-1:3] +``` + +This gives a somewhat cryptic error: + +```{r, error=TRUE, echo=FALSE} +x[-1:3] +``` + +演算の順番を思い出してみましょう。`:` は、実際には関数なのです。 最初の引数を-1、次の引数を3として認識し、次のような数列を生成します。 +`c(-1, 0, 1, 2, 3)` 正解は、関数を呼ぶ部分を括弧で囲むことです。 + +そうすると関数の結果全てに`-` の演算子が適応されます: ~~~ x[-(:)] ~~~ ~~~ d e: + +```{r} +x[-(1:3)] +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +ベクトルから要素を削除するには、結果を変数に戻してやる必要があります。 + +```{r} +x <- x[-4] +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +以下のリストがあるとします: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +以下の出力を得るために、少なくとも2つの異なるコマンドを考えてください: ~~~ b c d: + +```{r, echo=FALSE} +x[2:4] +``` + +After you find 2 different commands, compare notes with your neighbour. Did you have different strategies? + +::::::::::::::: solution + +## チャレンジ8の解答 1 + +```{r} +x[2:4] +``` + +```{r} +x[-c(1,5)] +``` + +```{r} +x[c(2,3,4)] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 名前で部分集合を作る + +要素番号で抜き出す代わりに、名前で要素を抽出することもできます。 + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # ベクトルを'その場で'名づけることができます x[c("a", "c")] +``` + +オブジェクトの部分集合を作るには、この方法の方が確実です:要素の場所は、 部分集合の演算子を繋いで使うことでよく変わるのですが、 名前は絶対に変わりません。 + +## Subsetting through other logical operations {#logical-operations} + +どんな論理ベクトルでも部分集合を作ることができます: + +```{r} +x[c(FALSE, FALSE, TRUE, FALSE, TRUE)] +``` + +つまり、以下の宣言は、前と同じ結果を返します。 + +```{r} +x[x > 7] +``` + +分割すると、この宣言は最初に `x 7` を計算し、論理ベクトル `c(FALSE, FALSE, TRUE, FALSE, TRUE)` を作ります。それから、 `TRUE` の値に対応する要素を `x` から選択しています。 + +名前で特定するという既出の方法を真似するため、 `==` を使うこともできます。 (比較には、 `=` ではなく、 `==` を使わないといけません): + +```{r} +x[names(x) == "a"] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:論理条件を組み合わせる + +We often want to combine multiple logical +criteria. For example, we might want to find all the countries that are +located in Asia **or** Europe **and** have life expectancies within a certain +range. Several operations for combining logical vectors exist in R: + +- `&`, the "logical AND" operator: returns `TRUE` if both the left and right + are `TRUE`. +- `|`, the "logical OR" operator: returns `TRUE`, if either the left or right + (or both) are `TRUE`. + +You may sometimes see `&&` and `||` instead of `&` and `|`. These two-character operators +only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them +for programming, i.e. deciding whether to execute a statement. + +- `!`, the "logical NOT" operator: converts `TRUE` to `FALSE` and `FALSE` to + `TRUE`. It can negate a single logical condition (eg `!TRUE` becomes + `FALSE`), or a whole vector of conditions(eg `!c(TRUE, FALSE)` becomes + `c(FALSE, TRUE)`). + +Additionally, you can compare the elements within a single vector using the +`all` function (which returns `TRUE` if every element of the vector is `TRUE`) +and the `any` function (which returns `TRUE` if one or more elements of the +vector are `TRUE`). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +以下のリストがあるとします: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +4よりも大きく7より小さいxの値を返す部分集合を作るコマンドを書きましょう。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +```{r} +x_subset <- x[x<7 & x>4] +print(x_subset) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:同じ名前がある場合 + +You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have +the same name --- although R tries to avoid this --- but row names +must be unique.) Consider these examples: + +```{r} +x <- 1:3 +x +names(x) <- c('a', 'a', 'a') +x +x['a'] # only returns first value +x[names(x) == 'a'] # returns all three values +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:演算子についてのヘルプを見る + +演算子を引用符で囲むことで、演算子についてのヘルプを検索できることを覚えておきましょう: `help("%in%")` または `?"%in%"`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 名前のある要素を飛ばす + +Skipping or removing named elements is a little harder. 名前のある要素を飛ばしたり削除したりすることは少しだけ難しくなります。もし、ある文字列にマイナス記号を付けて飛ばそうとすると、Rは文字列にマイナス記号を付ける方法を知らないと(若干控えめに)抗議するでしょう: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # まず、ベクトルにその場で名前を付けることから始めます x[-"a"] +``` + +でも、`!=` (不等号)演算子を使えば、やってもらいたかったことをしてくれる論理ベクトルが作れます: + +```{r} +x[names(x) != "a"] +``` + +Skipping multiple named indices is a little bit harder still. Suppose we want to drop the `"a"` and `"c"` elements, so we try this: + +```{r} +x[names(x)!=c("a","c")] +``` + +Rは 何か をしたのですが、私達が注目しなければならない警告も出しました。結果としては、どうやら 間違った回答 が帰ってきたみたいです(`"c"` の要素が、ベクトルに含まれています)! + +So what does `!=` actually do in this case? That's an excellent question. + +### 再利用 + +このコードの比較する部分を見てみましょう: + +```{r} +names(x) != c("a", "c") +``` + +Rは、`names(x)[3] != "c"` が明らかに間違いであるときに、このベクトルの3番目の要素をなぜ`TRUE` にしたのでしょうか。 +`!=` を使うとき、Rは左側の引数のそれぞれの要素を右側のそれぞれの要素と比較しようとします。 違う長さのベクトルを比較しようとすると、何が起こるのでしょう? + +![](fig/06-rmd-inequality.1.png){alt='Inequality testing'} + +もし、もう一つのベクトルよりも短いベクトルがあったとき、そのベクトルは 再利用されます : + +![](fig/06-rmd-inequality.2.png){alt='Inequality testing: results of recycling'} + +In this case R **repeats** `c("a", "c")` as many times as necessary to match `names(x)`, i.e. we get `c("a","c","a","c","a")`. Since the recycled `"a"` +doesn't match the third element of `names(x)`, the value of `!=` is `TRUE`. +Because in this case the longer vector length (5) isn't a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and `names(x)` had contained six elements, R would _silently_ have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs! + +The way to get R to do what we really want (match _each_ element of the left argument with _all_ of the elements of the right argument) it to use the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". Here, since we want to _exclude_ values, we also need a `!` operator to change "in" to "not in": + +```{r} +x[! names(x) %in% c("a","c") ] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains `country` and `continent` variables, but no information between +these two scales. Suppose we want to pull out information from southeast +Asia: how do we set up an operation to produce a logical vector that +is `TRUE` for all of the countries in southeast Asia and `FALSE` otherwise? + +Suppose you have these data: + +```{r} +seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos") +## read in the gapminder data that we downloaded in episode 2 +gapminder <- read.csv("data/gapminder_data.csv", header=TRUE) +## extract the `country` column from a data frame (we'll see this later); +## convert from a factor to a character; +## and get just the non-repeated elements +countries <- unique(as.character(gapminder$country)) +``` + +There's a wrong way (using only `==`), which will give you a warning; +a clunky way (using the logical operators `==` and `|`); and +an elegant way (using `%in%`). See whether you can come up with all three +and explain how they (don't) work. + +::::::::::::::: solution + +## チャレンジ3の解答 + +- The **wrong** way to do this problem is `countries==seAsia`. This + gives a warning (`"In countries == seAsia : longer object length is not a multiple of shorter object length"`) and the wrong answer (a vector of all + `FALSE` values), because none of the recycled values of `seAsia` happen + to line up correctly with matching values in `country`. +- The **clunky** (but technically correct) way to do this problem is + +```{r, results="hide"} + (countries=="Myanmar" | countries=="Thailand" | + countries=="Cambodia" | countries == "Vietnam" | countries=="Laos") +``` + +(or `countries==seAsia[1] | countries==seAsia[2] | ...`). This +gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?). + +- The best way to do this problem is `countries %in% seAsia`, which + is both correct and easy to type (and read). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 特別な値を扱う + +ある時点で、欠測値、無限値、未定義のデータを扱えないRの関数に出会うことでしょう。 + +データをフィルターするために使える特別な関数がいくつかあります: + +- `is.na` は、ベクトル、行列、データフレームで、 +- likewise, `is.nan`, and `is.infinite` will do the same for `NaN` and `Inf`. +- `is.finite` will return all positions in a vector, matrix, or data.frame + that do not contain `NA`, `NaN` or `Inf`. +- `na.omit` will filter out all missing values from a vector + +## 順序のない因子の部分集合を作る + +これまで部分集合ベクトルを作る色々な方法をやってみましたが、 他のデータ構造の部分集合を作るにはどうすればいいでしょう。 + +順序なし因子の部分集合を作る方法は、ベクトルの部分集合を作る方法と同じです。 + +```{r} +f <- factor(c("a", "a", "b", "c", "c", "d")) +f[f == "a"] +f[f %in% c("b", "c")] +f[1:3] +``` + +要素を飛ばし、その順序なし因子に該当カテゴリーが存在しない場合であっても、水準は削除されません: + +```{r} +f[-3] +``` + +## 行列の部分周到を作る + +Matrices are also subsetted using the `[` function. In this case +it takes two arguments: the first applying to the rows, the second +to its columns: + +```{r} +set.seed(1) +m <- matrix(rnorm(6*4), ncol=4, nrow=6) +m[3:4, c(3,1)] +``` + +それぞれ全ての列または行を取ってくるためには、最初または2番目の引数を空のままにしておきましょう: + +```{r} +m[, c(3,4)] +``` + +1つの列または行にアクセスした場合、Rは結果を自動的にベクトルに変換します: + +```{r} +m[3,] +``` + +もし、アウトプットを行列のままにしておきたいなら、 3番目の 因数、 `drop = FALSE` が必要です: + +```{r} +m[3, , drop=FALSE] +``` + +ベクトルと違って、行列の外の行や列にアクセスしようとすると、Rはエラーを返します: + +```{r, error=TRUE} +m[, c(3,6)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:高次元列 + +多次元列を扱う際、`[` のそれぞれの引数は、次元に対応しています。 例えば、3次元列は、最初の3つの引数が、行、列、次元の深さに対応してます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +行列はベクトルなので、1つの引数だけを使って部分集合を作ることもできます: + +```{r} +m[5] +``` + +This usually isn't useful, and often confusing to read. However it is useful to note that matrices +are laid out in _column-major format_ by default. That is the elements of the +vector are arranged column-wise: + +```{r} +matrix(1:6, nrow=2, ncol=3) +``` + +もし、行列を行の順番で埋めていきたい場合は、 `byrow=TRUE` を使います: + +```{r} +matrix(1:6, nrow=2, ncol=3, byrow=TRUE) +``` + +行列もまた、行及び列の要素番号の代わりに、名前で部分集合を作ることができます。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ4 + +以下のリストがあるとします: + +```{r} +m <- matrix(1:18, nrow=3, ncol=6) +print(m) +``` + +1. 次のコマンドのうち、11と14を抜き出すことができるコマンドはどれでしょう? + +A. `m[2,4,2,5]` + +B. `m[2:5]` + +C. `m[4:5,2]` + +D. `m[2,c(4,5)]` + +::::::::::::::: solution + +## チャレンジ8の解答 + +D + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## リストの分部集合を作る + +Now we'll introduce some new subsetting operators. There are three functions +used to subset lists. We've already seen these when learning about atomic vectors and matrices: `[`, `[[`, and `$`. + +Using `[` will always return a list. If you want to _subset_ a list, but not +_extract_ an element, then you will likely use `[`. + +```{r} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +xlist[1] +``` + +これは、 1つの要素を持つリスト です。 + +`[` を使って原子ベクトルを作ったのと全く同じ方法で、リストの要素から部分集合を作ることができます。 しかし、比較処理は反復的ではないため、使えません。比較処理は、リストのそれぞれの要素のデータ構造にある、個々の要素ではなく、 データ構造に条件付けをしようとするからです。 + +```{r} +xlist[1:2] +``` + +リストの個々の要素を抜き出すためには、二重角括弧 `[[` を使う必要があります: + +```{r} +xlist[[1]] +``` + +ここで結果がリストではなく、ベクトルとなっていることに気をつけましょう。 + +1つの要素を同時に抜き出すことはできません: + +```{r, error=TRUE} +xlist[[1:2]] +``` + +また、要素を飛ばすこともできません: + +```{r, error=TRUE} +xlist[[-1]] +``` + +でも両方の部分集合の名前を使って、要素を抽出することはできます: + +```{r} +xlist[["a"]] +``` + +`$` 関数は、簡単に名前で要素を抽出できるものです。 + +```{r} +xlist$data +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ5 + +以下のリストがあるとします: + +```{r, eval=FALSE} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +``` + +Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +ヒント:数字の2は、リスト「b」の中にあります。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +```{r} +xlist$b[2] +``` + +```{r} +xlist[[2]][2] +``` + +```{r} +xlist[["b"]][2] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ6 + +以下のような線形モデルあるとします: + +```{r, eval=FALSE} +mod <- aov(pop ~ lifeExp, data=gapminder) +``` + +Extract the residual degrees of freedom (hint: `attributes()` will help you) + +::::::::::::::: solution + +## チャレンジ8の解答 + +```{r, eval=FALSE} +attributes(mod) ## `df.residual` is one of the names of `mod` +``` + +```{r, eval=FALSE} +mod$df.residual +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## データフレーム + +データフレームの中身は実はリストなので、リストと同じようなルールがあてはまることを覚えておきましょう。 しかし、データフレームは2次元のオブジェクトでもあります。 + +1つの引数しかない `[` は、リストと同じような働きがあり、それぞれのリストの要素が列に対応します。 その結果、返されるオブジェクトはデータフレームになります: + +```{r} +head(gapminder[3]) +``` + +同様に、 `[[` は、 単一の列 を抜き出す働きをするものです: + +```{r} +head(gapminder[["lifeExp"]]) +``` + +そして `$` は、簡単に列名で列を抽出できるものです: + +```{r} +head(gapminder$year) +``` + +2つの引数を使えば、 `[` は、行列と同じような働きをします: + +```{r} +gapminder[1:3,] +``` + +もし、1つの行を部分集合する場合、結果はデータフレームになります (理由は、要素には色々なデータ型が混ざっているからです): + +```{r} +gapminder[3,] +``` + +しかし、1つの行についての結果は、ベクトルになります (これは、3番目の引数を `drop = FALSE` とすれば変えられます)。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +gapminder[gapminder$year = 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +gapminder[,-1:4] +``` + +3. Extract the rows where the life expectancy is longer the 80 years + +```{r, eval=FALSE} +gapminder[gapminder$lifeExp > 80] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +gapminder[1, 4, 5] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +gapminder[gapminder$year == 2002 | 2007,] +``` + +::::::::::::::: solution + +## チャレンジ8の解答 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +# gapminder[gapminder$year = 1957,] +gapminder[gapminder$year == 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +# gapminder[,-1:4] +gapminder[,-c(1:4)] +``` + +3. Extract the rows where the life expectancy is longer than 80 years + +```{r, eval=FALSE} +# gapminder[gapminder$lifeExp > 80] +gapminder[gapminder$lifeExp > 80,] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +# gapminder[1, 4, 5] +gapminder[1, c(4, 5)] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +# gapminder[gapminder$year == 2002 | 2007,] +gapminder[gapminder$year == 2002 | gapminder$year == 2007,] +gapminder[gapminder$year %in% c(2002, 2007),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ8 + +1. なぜ、 `gapminder[1:20]` は、エラーを返すのでしょうか? `gapminder[1:20, ]` とどう違うのでしょう? + +2. 新しく `gapminder_small` という、1から9の行だけを含む `data.frame` を作ってください。 これは、1つまたは2つの手順でできます。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +1. `gapminder` は、データフレームなので、2つの次元の部分集合を作る必要があります。 `gapminder[1:20, ]` は、最初から20番目の行までについて全ての列を引き出します。 + +2. + +```{r} +gapminder_small <- gapminder[c(1:9, 19:23),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Indexing in R starts at 1, not 0. +- Access individual values by location using `[]`. +- Access slices of data using `[low:high]`. +- Access arbitrary sets of data using `[c(...)]`. +- Use logical operations and logical vectors to access subsets of data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/07-control-flow.Rmd b/locale/ja/episodes/07-control-flow.Rmd new file mode 100644 index 000000000..070d0b639 --- /dev/null +++ b/locale/ja/episodes/07-control-flow.Rmd @@ -0,0 +1,530 @@ +--- +title: Control Flow +teaching: 45 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Write conditional statements with `if...else` statements and `ifelse()`. +- Write and understand `for()` loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make data-dependent choices in R? +- How can I repeat operations in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +set.seed(10) +``` + +コードを書く際、実行の流れを制御する必要がよくあります。 これは、ある条件、または一連の条件が満たされたときに実行されるようにすればできます。 +あるいは、決まった回数実行されるよう設定することもできます。 + +There are several ways you can control flow in R. +For conditional statements, the most commonly used approaches are the constructs: + +```{r, eval=FALSE} +~~~ # if if (condition is true) { + perform action +} # if ... else if (condition is true) { # 条件が満たされた場合 アクションを行う } else { # つまり、条件が満たされなかった場合 別のアクションを行う } ~~~ +``` + +例えばRに、もし変数 `x` が特定の値を持っていた場合、メッセージを表示させたいとします。 + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} + +x +``` + +The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else { + print("x is less than 10") +} +``` + +`else if` を使うと、複数の条件を試すこともできます。 + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else if (x > 5) { + print("x is greater than 5, but less than 10") +} else { + print("x is less than 5") +} +``` + +**Important:** when R evaluates the condition inside `if()` statements, it is +looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some +headaches for beginners. 例えば: + +```{r} +x <- 4 == 3 +if (x) { + "4 equals 3" +} else { + "4 does not equal 3" +} +``` + +ここで見られるように、ベクトル x が `FALSE` であるため、不等号のメッセージが表示されました。 + +```{r} +x <- 4 == 3 +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Use an `if()` statement to print a suitable message +reporting whether there are any records from 2002 in +the `gapminder` dataset. +Now do the same for 2012. + +::::::::::::::: solution + +## チャレンジ3の解答 + +We will first see a solution to Challenge 1 which does not use the `any()` function. +We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`: + +```{r ch10pt1-sol, eval=FALSE} +gapminder[(gapminder$year == 2002),] +``` + +Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002: + +```{r ch10pt2-sol, eval=FALSE} +rows2002_number <- nrow(gapminder[(gapminder$year == 2002),]) +``` + +The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more: + +```{r ch10pt3-sol, eval=FALSE} +rows2002_number >= 1 +``` + +Putting all together, we obtain: + +```{r ch10pt4-sol, eval=FALSE} +if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){ + print("Record(s) for the year 2002 found.") +} +``` + +All this can be done more quickly with `any()`. The logical condition can be expressed as: + +```{r ch10pt5-sol, eval=FALSE} +if(any(gapminder$year == 2002)){ + print("Record(s) for the year 2002 found.") +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +次のような警告メッセージをもらった人はいますか? + +```{r, echo=FALSE} +if (gapminder$year == 2012) {} +``` + +The `if()` function only accepts singular (of length 1) inputs, and therefore +returns an error when you use it with a vector. The `if()` function will still +run, but will only evaluate the condition in the first element of the vector. +Therefore, to use the `if()` function, you need to make sure your input is +singular (of length 1). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Built in `ifelse()` function + +`R` accepts both `if()` and `else if()` statements structured as outlined above, +but also statements using `R`'s built-in `ifelse()` function. This +function accepts both singular and vector inputs and is structured as +follows: + +```{r, eval=FALSE} +# ifelse function +ifelse(condition is true, perform action, perform alternative action) + +``` + +where the first argument is the condition or a set of conditions to be met, the +second argument is the statement that is evaluated when the condition is `TRUE`, +and the third statement is the statement that is evaluated when the condition +is `FALSE`. + +```{r} +y <- -3 +ifelse(y < 0, "y is a negative number", "y is either positive or zero") + +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:`any()` と `all()` + +`any()` 関数は、ベクトルの中に少なくとも1つ `TRUE` の値がある場合、 `TRUE` を返し、 そうでない場合は、 `FALSE` を返します。 +これは、 `%in%` 演算子でも同様に使えます。 +関数 `all()` は、その名前が示唆しているように、ベクトル内の全ての値が `TRUE` である時のみ、 `TRUE` となります。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 繰り返し行う処理 + +If you want to iterate over +a set of values, when the order of iteration is important, and perform the +same operation on each, a `for()` loop will do the job. +We saw `for()` loops in the [shell lessons earlier](https://swcarpentry.github.io/shell-novice/05-loop.html). This is the most +flexible of looping operations, but therefore also the hardest to use +correctly. In general, the advice of many `R` users would be to learn about +`for()` loops, but to avoid using `for()` loops unless the order of iteration is +important: i.e. the calculation at each iteration depends on the results of +previous iterations. If the order of iteration is not important, then you +should learn about vectorized alternatives, such as the `purrr` package, as they +pay off in computational efficiency. + +`for()` ループの基本構造は: + +```{r, eval=FALSE} +for (iterator in set of values) { + do a thing +} +``` + +例えば: + +```{r} +for (i in 1:10) { + print(i) +} +``` + +`1:10` の部分は、ベクトルをその場で作るものです。 他のベクトルの中身を繰り返すこともできます。 + +`for()` ループを、もうひとつの `for()` ループと入れ子となる形にすれば、 2つ同時に繰り返すこともできます。 + +```{r} +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + print(paste(i,j)) + } +} +``` + +We notice in the output that when the first index (`i`) is set to 1, the second +index (`j`) iterates through its full set of indices. Once the indices of `j` +have been iterated through, then `i` is incremented. This process continues +until the last index has been used for each `for()` loop. + +結果を表示させずに、ループの結果を新しいオブジェクトに書き込むこともできます。 + +```{r} +output_vector <- c() +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + temp_output <- paste(i, j) + output_vector <- c(output_vector, temp_output) + } +} +output_vector +``` + +このアプローチが役に立つこともありますが、'結果を太らせる' (結果のオブジェクトを 徐々に積み上げる)と、演算する上で非効率になります。 ゆえに、多くの値の間を繰り返すときは避けましょう。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:結果を太らせないようにしましょう + +One of the biggest things that trips up novices and +experienced R users alike, is building a results object +(vector, list, matrix, data frame) as your for loop progresses. +Computers are very bad at handling this, so your calculations +can very quickly slow to a crawl. It's much better to define +an empty results object before hand of appropriate dimensions, rather +than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +よりよい方法は、(空の)出力オブジェクトを、値を埋める前に宣言することです。 +この例では、より複雑に見えますが、より効率的です。 + +```{r} +output_matrix <- matrix(nrow = 5, ncol = 5) +j_vector <- c('a', 'b', 'c', 'd', 'e') +for (i in 1:5) { + for (j in 1:5) { + temp_j_value <- j_vector[j] + temp_output <- paste(i, temp_j_value) + output_matrix[i, j] <- temp_output + } +} +output_vector2 <- as.vector(output_matrix) +output_vector2 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:while ループ + +時には、ある条件が満たされるまで繰り返す必要があります。 これは、 `while()` ループを使えばできます。 + +```{r, eval=FALSE} +while(this condition is true){ + do a thing +} +``` + +R will interpret a condition being met as "TRUE". + +```while(this condition is true) \~\~\~ 例として、このwhileループは 一様分布(`runif()` 関数)から0.1よりも小さい数を得るまで、 0から1の間で乱数を生成します。 +``` + +```r +z <- 1 +while(z > 0.1){ + z <- runif(1) + cat(z, "\n") +} +``` + +`while()` loops will not always be appropriate. You have to be particularly careful +that you don't end up stuck in an infinite loop because your condition is always met and hence the while statement never terminates. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +Compare the objects `output_vector` and +`output_vector2`. Are they the same? If not, why not? +How would you change the last block of code to make `output_vector2` +the same as `output_vector`? + +::::::::::::::: solution + +## チャレンジ3の解答 + +We can check whether the two vectors are identical using the `all()` function: + +```{r ch10pt6-sol, eval=FALSE} +all(output_vector == output_vector2) +``` + +However, all the elements of `output_vector` can be found in `output_vector2`: + +```{r ch10pt7-sol, eval=FALSE} +all(output_vector %in% output_vector2) +``` + +and vice versa: + +```{r ch10pt8-sol, eval=FALSE} +all(output_vector2 %in% output_vector) +``` + +therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order. +This is because `as.vector()` outputs the elements of an input matrix going over its column. +Taking a look at `output_matrix`, we can notice that we want its elements by rows. +The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function +`t()` or by inputting the elements in the right order. +The first solution requires to change the original + +```{r ch10pt9-sol, eval=FALSE} +output_vector2 <- as.vector(output_matrix) +``` + +into + +```{r ch10pt10-sol, eval=FALSE} +output_vector2 <- as.vector(t(output_matrix)) +``` + +The second solution requires to change + +```{r ch10pt11-sol, eval=FALSE} +output_matrix[i, j] <- temp_output +``` + +into + +```{r ch10pt12-sol, eval=FALSE} +output_matrix[j, i] <- temp_output +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +`gapminder` データを大陸ごとにループし、平均余命が50歳以上かどうかを表示する スクリプトを書きましょう。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +**Step 1**: We want to make sure we can extract all the unique values of the continent vector + +```{r 07-chall-03-sol-a, eval=FALSE} +gapminder <- read.csv("data/gapminder_data.csv") +unique(gapminder$continent) +``` + +**Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data.```gapminder <- read.csv("data/gapminder\_data.csv") unique(gapminder$continent) \~\~\~ {: .language-r} 手順2 :これらの大陸のそれぞれにループをし、その `部分集合` データごとに平均余命を出す必要があります。 +``` + +1. それは次のようにすればできます: 1. +2. '大陸(continent)' の固有の値のそれぞれについてループする 2. +3. Return the calculated life expectancy to the user by printing the output: + +```{r 07-chall-03-sol-b, eval=FALSE} +for (iContinent in unique(gapminder$continent)) { + tmp <- gapminder[gapminder$continent == iContinent, ] + cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n") + rm(tmp) +} +``` + +**Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50. +ゆえに、結果を表示させる前に `if` 条件をつけて、演算された平均余命が基準値以上か、基準値未満かを判別し、結果によって正しい出力を表示させる必要があります。 +これを踏まえて、上の (3) を修正する必要があります: 3a. + +3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold: + +```{r 07-chall-03-sol-c, eval=FALSE} +thresholdValue <- 50 + +for (iContinent in unique(gapminder$continent)) { + tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"]) + + if (tmp < thresholdValue){ + cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n") + } else { + cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n") + } # end if else condition + rm(tmp) +} # end for loop + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ4 + +チャレンジ3のスクリプトをそれぞれの国ごとにループする形に直してください。 今回は、平均余命は50歳未満か、50歳以上70歳未満か、70歳以上かを 表示しましょう。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements: + +```{r 07-chall-04-sol, eval=FALSE} +チャレンジ4の解答 チャレンジ3の解答を、 `lowerThreshold` と `upperThreshold` の2つの基準値を加え、if-else 宣言を拡張する形で修正します: ~~~ lowerThreshold <- 50 upperThreshold <- 70 for( iCountry in unique(gapminder$country) ){ tmp <- mean(subset(gapminder, country==iCountry)$lifeExp) if(tmp < lowerThreshold){ cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\\n") } else if(tmp lowerThreshold && tmp < upperThreshold){ cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\\n") } else{ cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\\n") } rm(tmp) } ~~~ {: .language-r} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ5 - 上級 + +Write a script that loops over each country in the `gapminder` dataset, +tests whether the country starts with a 'B', and graphs life expectancy +against time as a line graph if the mean life expectancy is under 50 years. + +::::::::::::::: solution + +## チャレンジ3の解答 + +We will use the `grep()` command that was introduced in the [Unix Shell lesson](https://swcarpentry.github.io/shell-novice/07-find.html) +to find countries that start with "B." +Lets understand how to do this first. +Following from the Unix shell section we may be tempted to try the following + +```{r 07-chall-05-sol-a, eval=FALSE} +grep("^B", unique(gapminder$country)) +``` + +But when we evaluate this command it returns the indices of the factor variable `country` that start with "B." +To get the values, we must add the `value=TRUE` option to the `grep()` command: + +```{r 07-chall-05-sol-b, eval=FALSE} +grep("^B", unique(gapminder$country), value = TRUE) +``` + +We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy using `with()` and `subset()`: + +```{r 07-chall-05-sol-c, eval=FALSE} +thresholdValue <- 50 +candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE) + +for (iCountry in candidateCountries) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if (tmp < thresholdValue) { + cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n") + + with(subset(gapminder, country == iCountry), + plot(year, lifeExp, + type = "o", + main = paste("Life Expectancy in", iCountry, "over time"), + ylab = "Life Expectancy", + xlab = "Year" + ) # end plot + ) # end with + } # end if + rm(tmp) +} # end for loop +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `if` and `else` to make choices. +- Use `for` to repeat operations. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/08-plot-ggplot2.Rmd b/locale/ja/episodes/08-plot-ggplot2.Rmd new file mode 100644 index 000000000..b45da38f6 --- /dev/null +++ b/locale/ja/episodes/08-plot-ggplot2.Rmd @@ -0,0 +1,437 @@ +--- +title: Creating Publication-Quality Graphics with ggplot2 +teaching: 60 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use ggplot2 to generate publication-quality graphics. +- To apply geometry, aesthetic, and statistics layers to a ggplot plot. +- To manipulate the aesthetics of a plot using different colors, shapes, and lines. +- To improve data visualization through transforming scales and paneling by group. +- To save a plot created with ggplot to disk. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I create publication-quality graphics in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +データをプロットすることは、データとその変数間の様々な関係をクイックに探索する最良の方法の一つです。 + +Rには、主に3つのプロットシステムがあります。 [R組み込みplot関数][base]、 [lattice][lattice] パッケージ、[ggplot2][ggplot2] パッケージです。 + +今回、私たちはggplot2パッケージについて学んでいきます。 なぜなら、ggplot2パッケージは出版品質並のグラフィック作成に最も効果的だからです。 + +ggplot2 is built on the grammar of graphics, the idea that any plot can be +built from the same set of components: a **data set**, +**mapping aesthetics**, and graphical **layers**: + +- **Data sets** are the data that you, the user, provide. + +- **Mapping aesthetics** are what connect the data to the graphics. + They tell ggplot2 how to use your data to affect how the graph looks, + such as changing what is plotted on the X or Y axis, or the size or + color of different data points. + +- **Layers** are the actual graphical output from ggplot2. Layers + determine what kinds of plot are shown (scatterplot, histogram, etc.), + the coordinate system used (rectangular, polar, others), and other + important aspects of the plot. このアイディアは、Photoshop、Illustrator、Inkscapeなどの画像編集ソフトを使用する場面でお馴染みかもしれません。 + +Let's start off building an example using the gapminder data from earlier. +The most basic function is `ggplot`, which lets R know that we're +creating a new plot. Any of the arguments we give the `ggplot` +function are the _global_ options for the plot: they apply to all +layers on the plot. + +```{r blank-ggplot, message=FALSE, fig.alt="Blank plot, before adding any mapping aesthetics to ggplot()."} +library("ggplot2") +ggplot(data = gapminder) +``` + +Here we called `ggplot` and told it what data we want to show on +our figure. This is not enough information for `ggplot` to actually +draw anything. It only creates a blank slate for other elements +to be added to. + +Now we're going to add in the **mapping aesthetics** using the +`aes` function. `aes` tells `ggplot` how variables in the **data** +map to _aesthetic_ properties of the figure, such as which columns +of the data should be used for the **x** and **y** locations. + +```{r ggplot-with-aes, message=FALSE, fig.alt="Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +``` + +Here we told `ggplot` we want to plot the "gdpPercap" column of the +gapminder data frame on the x-axis, and the "lifeExp" column on the +y-axis. Notice that we didn't need to explicitly pass `aes` these +columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because +`ggplot` is smart enough to know to look in the **data** for that column! + +The final part of making our plot is to tell `ggplot` how we want to +visually represent the data. We do this by adding a new **layer** +to the plot using one of the **geom** functions. + +```{r lifeExp-vs-gdpPercap-scatter, message=FALSE, fig.alt="Scatter plot of life expectancy vs GDP per capita, now showing the data points."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Here we used `geom_point`, which tells `ggplot` we want to visually +represent the relationship between **x** and **y** as a scatterplot of points. + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Modify the example so that the figure shows how life expectancy has +changed over time: + +```{r, eval=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() +``` + +Hint: the gapminder dataset has a column called "year", which should appear +on the x-axis. + +::::::::::::::: solution + +## チャレンジ8の解答 1 + +Here is one possible solution: + +```{r ch1-sol, fig.cap="Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +チャレンジ 2 先の例題とチャレンジでは、`aes` 関数を使用して、各点の x と y の位置について散布図 geom を指定しました。 +修正できるもう1つのエステティック属性は、点の色です。 先のチャレンジのコードを修正して、“continent” 列で点に色付けして下さい。 データにどはどのような傾向が見られますか? それらの傾向は、あなたが期待したものですか? + +::::::::::::::: solution + +## チャレンジ8の解答 + +The solution presented below adds `color=continent` to the call of the `aes` +function. The general trend seems to indicate an increased life expectancy +over the years. On continents with stronger economies we find a longer life +expectancy. + +```{r ch2-sol, fig.cap="Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## レイヤー + +散布図を使用することは、時間経過による変化を視覚化するのに、おそらく最適ではありません。 +代わりに、データを線グラフとして可視化するようggplotに指示しましょう。 + +```{r lifeExp-line} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) + + geom_line() +``` + +`geom_point`レイヤーを追加する代わりに、`geom_line`レイヤーを追加しました。 + +However, the result doesn't look quite as we might have expected: it seems to be jumping around a lot in each continent. Let's try to separate the data by country, plotting one line for each country: + +```{r lifeExp-line-by} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() +``` + +by エステティックを追加し、各国ごとに線を描くよう`ggplot`に指示します。 + +しかし、線と点の両方をプロット上に視覚化したい場合はどうすればよいでしょうか? プロットに別のレイヤーを追加するだけです。 + +```{r lifeExp-line-point} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() + geom_point() +``` + +It's important to note that each layer is drawn on top of the previous layer. In +this example, the points have been drawn _on top of_ the lines. Here's a +demonstration: + +```{r lifeExp-layer-example-1} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_line(mapping = aes(color=continent)) + geom_point() +``` + +この例では、 color エステティックマッピングが、`ggplot`のグローバルプロットオプションから`geom_line`レイヤーに移動されたため、 点には色が適用されなくなりました。 これで、点が線の上に描画されていることがわかります。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:エステティック属性に、マッピングの代わりに値を設定する + +So far, we've seen how to use an aesthetic (such as **color**) as a _mapping_ to a variable in the data. For example, when we use `geom_line(mapping = aes(color=continent))`, ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that `geom_line(mapping = aes(color="blue"))` should work, but it doesn't. Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the `aes()` function, like this: `geom_line(color="blue")`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +Switch the order of the point and line layers from the previous example. 何が起こったのでしょう?Rは、csvファイルを読み込む際、 列にある全てのものが同じ基本の型であるべきだと主張します。もし、列の 全て が、 double型であることが確認できない場合、その列の だれも double型にならないのです。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +The lines now get drawn over the points! + +```{r ch3-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_point() + geom_line(mapping = aes(color=continent)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 変換と統計 + +ggplot2を使用すると、統計モデルをデータに適用することが容易になります。 デモのために、最初の例に戻ります。 + +```{r lifeExp-vs-gdpPercap-scatter3, message=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Currently it's hard to see the relationship between the points due to some strong +outliers in GDP per capita. We can change the scale of units on the x axis using +the _scale_ functions. These control the mapping between the data values and +visual values of an aesthetic. We can also modify the transparency of the +points, using the _alpha_ function, which is especially helpful when you have +a large amount of data which is very clustered. + +```{r axis-scale, fig.cap="Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread"} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() +``` + +The `scale_x_log10` function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis. + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒントのリマインダ:エステティック属性に、マッピングの代わりに値を設定する + +`geom_point(alpha = 0.5)`を使用したことに注目してください。 先のヒントで触れたように、`aes()`関数以外の設定を使用すると、この値がすべての点で使用されます。 この場合、この値が必要です。しかし、他のエステティック設定と同様に、 alpha はデータ内の変数にマッピングすることもできます。 たとえば、`geom_point(aes(alpha = continent))`を使用して、各大陸に異なる透明度を与えることができます。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +`geom_smooth`という別のレイヤーを追加することで、データに単純な関係を当てはめることができます。 + +```{r lm-fit, fig.alt="Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm") +``` + +`geom_smooth`レイヤーで size エステティック属性を設定することによって、 線を太くすることができます。 + +```{r lm-fit2, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5) +``` + +エステティック属性を指定する方法は2つあります。 Here we _set_ the **linewidth** aesthetic by passing it as an argument to `geom_smooth` and it is applied the same to the whole `geom`. これまでのレッスンでは、データの変数とその視覚表現の間のマッピングを定義するために`aes`関数を使用しました。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 4a + +前の例を用いて、点レイヤー上の点の色とサイズを変更して下さい。 + +ヒント:'aes'関数を使用しないでください。 + +Hint: the equivalent of `linewidth` for points is `size`. + +::::::::::::::: solution + +## チャレンジ8の解答 + +Here a possible solution: +Notice that the `color` argument is supplied outside of the `aes()` function. +This means that it applies to all data points on the graph and is not related to +a specific variable. + +```{r ch4a-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(size=3, color="orange") + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ 4b + +点を異なる形にし、また大陸毎に色分けと傾向線の描画をするために、 チャレンジ4aの回答を変更して下さい。 ヒント:color引数は、aes関数内で使用することができます。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +Here is a possible solution: +Notice that supplying the `color` argument inside the `aes()` functions enables you to +connect it to a certain variable. The `shape` argument, as you can see, modifies all +data points the same way (it is outside the `aes()` call) while the `color` argument which +is placed inside the `aes()` call modifies a point's color based on its continent value. + +```{r ch4b-sol} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point(size=3, shape=17) + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 複数パネルの図 + +先の例では、すべての国の平均余命の変化を1つのプロットで視覚化しました。 一方、 facet パネルのレイヤーを追加することで、複数のパネルに分割することができます。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント + +We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to +clutter the figure. Note that we apply a "theme" definition to rotate +the x-axis labels to maintain readability. Nearly everything in +ggplot2 is customizable. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r facet} +americas <- gapminder[gapminder$continent == "Americas",] +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +`facet_wrap`レイヤーは引数として“formula”をとり、チルダ(~)で表記されます。 これは、gapminderデータセットのcountry列にある各々の一意な値のパネルを描画するようRに指示します。 + +## テキストの変更 + +分析結果の発表に向けてこの図を整理するにあたり、いくつかのテキスト要素を変更する必要があります。 x軸はあまりにも雑然としており、y軸はデータフレームの列名ではなく、“Life expectancy”と読み替えるべきです。 + +これを行うには、いくつかのレイヤーを追加する必要があります。 theme レイヤーは、軸テキストと全体のテキストサイズを制御します。 軸、プロットタイトル、および任意の凡例のラベルは、`labs`関数を使用して設定できます。 凡例のタイトルは、`aes`関数で使用したものと同じ名前を設定します。 したがって、color凡例のタイトルは`color = "Continent"`を用いて設定され、 fill凡例のタイトルは`fill = "任意のタイトル"`を使用して設定されます。 + +```{r theme} +ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + +## プロットのエクスポート + +`ggsave()`関数を使用すると、ggplotで作成したプロットをエクスポートすることができます。 出版、公開のための高品質グラフィックを作成するために、 適切な引数(`width`、`height`、および`dpi`)を調整してプロットの寸法と解像度を指定できます。 上記のように、そのプロットを保存するには、最初にそのプロットを変数`lifeExp_plot`に割り当て、 `ggsave`にそのプロットを`png`形式で`results`というディレクトリに保存するよう指示します。 (作業ディレクトリに'results /'フォルダがあることを確認してください。) + +```{r directory-check, echo=FALSE} +if (!dir.exists("results")) { + dir.create("results") +} +``` + +```{r save} +lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + +ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm") +``` + +`ggsave`には素晴らしい点が二つあります。 一つ目は、最後のプロットがデフォルトになるので、`plot`引数を省略すると、`ggplot`で作成した最後のプロットが自動的に保存されることです。 二つ目は、ファイル名に指定したファイル拡張子(例:`.png`または`.pdf`)からプロットを保存するフォーマットを決定しようとします。 必要な場合は、`device`引数に明示的にフォーマットを指定できます。 + +This is a taste of what you can do with ggplot2. RStudio provides a +really useful [cheat sheet][cheat] of the different layers available, and more +extensive documentation is available on the [ggplot2 website][ggplot-doc]. All RStudio cheat sheets are available from the [RStudio website][cheat_all]. +Finally, if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow with reusable +code to modify! + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ5 + +Generate boxplots to compare life expectancy between the different continents during the available years. + +Advanced: + +- Rename y axis as Life Expectancy. +- Remove x axis labels. + +::::::::::::::: solution + +## チャレンジ3の解答 + +Here a possible solution: +`xlab()` and `ylab()` set labels for the x and y axes, respectively +The axis title, text and ticks are attributes of the theme and must be modified within a `theme()` call. + +```{r ch5-sol} +ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) + + geom_boxplot() + facet_wrap(~year) + + ylab("Life Expectancy") + + theme(axis.title.x=element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[base]: https://www.statmethods.net/graphs/index.html +[lattice]: https://www.statmethods.net/advgraphs/trellis.html +[ggplot2]: https://www.statmethods.net/advgraphs/ggplot2.html +[cheat]: https://www.rstudio.org/links/data_visualization_cheat_sheet +[cheat_all]: https://www.rstudio.com/resources/cheatsheets/ +[ggplot-doc]: https://ggplot2.tidyverse.org/reference/ + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `ggplot2` to create plots. +- Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/09-vectorization.Rmd b/locale/ja/episodes/09-vectorization.Rmd new file mode 100644 index 000000000..7aee19d23 --- /dev/null +++ b/locale/ja/episodes/09-vectorization.Rmd @@ -0,0 +1,314 @@ +--- +title: ベクトル化 +teaching: 10 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand vectorized operations in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I operate on all the elements of a vector at once? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +library("ggplot2") +``` + +Rの関数はほとんどがベクトル化されており、関数はベクトルの全ての要素を最初から操作してくれるので、ベクトルの要素ごとにいちいちループする必要がありません。 おかげで簡潔で読み易く、エラーの少ないコードを書くことができます。 + +```{r} +x <- 1:4 +x * 2 +``` + +積はベクトルの要素ごとに実行されました。 + +2つのベクトルを足し合わせることもできます: + +```{r} +y <- 6:9 +x + y +``` + +この場合 `x` の各要素が対応する `y`の要素に足されます。 + +```{r, eval=FALSE} +x: 1 2 3 4 + + + + + +y: 6 7 8 9 +--------------- + 7 9 11 13 +``` + +Here is how we would add two vectors together using a for loop: + +```{r} +output_vector <- c() +for (i in 1:4) { + output_vector[i] <- x[i] + y[i] +} +output_vector + + +``` + +Compare this to the output using vectorised operations. + +```{r} +sum_xy <- x + y +sum_xy +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +`gapminder` データセットの `pop` 列でこれに挑戦してみましょう。 + +`gapminder` データフレームに百万人単位の人口を示す列を追加しましょう。 +データフレームの先頭か最後を確認して、追加に成功したか確認しましょう。 + +::::::::::::::: solution + +## チャレンジ8の解答 1 + +`gapminder` データセットの `pop` 列でこれに挑戦してみましょう。 + +`gapminder` データフレームに百万人単位の人口を示す列を追加しましょう。 +データフレームの先頭か最後を確認して、追加に成功したか確認しましょう。 + +```{r} +gapminder$pop_millions <- gapminder$pop / 1e6 +head(gapminder) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +一つの図に、100万人単位の人口を年ごとにプロットしてみましょう。 国ごとの区別はつかなくていいです。 + +練習を繰り返して、中国とインドとインドネシアだけを含む図を 作ってみましょう。 先程と同じく、国ごとの区別はつかなくていいです。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +チャレンジの解答 100万人単位の人口を年ごとにプロットして、作図方法を復習しましょう。 + +```{r ch2-sol, fig.alt="Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled."} +ggplot(gapminder, aes(x = year, y = pop_millions)) + + geom_point() +countryset <- c("China","India","Indonesia") +ggplot(gapminder[gapminder$country %in% countryset,], + aes(x = year, y = pop_millions)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +比較演算子や論理演算子に加え、多くの関数もベクトル化されています。 + +**比較演算子** + +```{r} +x > 2 +``` + +**論理演算子** + +```{r} +a <- x 3 # より明確な書き方は a <- (x 3) a +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: 論理ベクトルに使える便利な関数 + +- `any()` はベクトルの要素の中に一つでも `TRUE` があれば `TRUE` を返します。 +- `all()` はベクトルの要素が 全て `TRUE` であれば `TRUE` を返します。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +ほとんどの関数はベクトルを要素ごとに処理します。 + +**関数** + +```{r} +x <- 1:4 +log(x) +``` + +ベクトル化された操作は行列を要素ごとに処理します: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m * -1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: 要素ごとの積 vs. 行列の積 + +非常に重要: ` ` 演算子は要素ごとの積を行います! +To do matrix multiplication, we need to use the `%*%` operator: + +```{r} +m %*% matrix(1, nrow=4, ncol=1) +matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1) +``` + +更に行列代数について知るには [Quick-R reference guide](https://www.statmethods.net/advstats/matrix.html) を参照して下さい + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +以下のリストがあるとします: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` +2. `m * c(1, 0, -1)` +3. `m > c(0, 20)` +4. `m * c(1, 0, -1, 2)` + +Did you get the output you expected? If not, ask a helper! + +::::::::::::::: solution + +## チャレンジ3の解答 + +以下のリストがあるとします: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` + +```{r, echo=FALSE} +m ^ -1 +``` + +2. `m * c(1, 0, -1)` + +```{r, echo=FALSE} +m * c(1, 0, -1) +``` + +3. `m > c(0, 20)` + +```{r, echo=FALSE} +m > c(0, 20) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ4 + +以下の分数の数列の総和が知りたいとします: ~~~ x: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +- /(n^) ~~~ これをタイプするのは面倒な上に、nが大きいと不可能です。 ベクトル化を用いて n = 100 の場合を計算しましょう。 n = 10,000 の時の総和はいくつでしょうか? + +::::::::::::::: solution + +## チャレンジ4 + +以下の分数の数列の総和が知りたいとします: ~~~ x: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +- /(n^) ~~~ これをタイプするのは面倒な上に、nが大きいと不可能です。 + ベクトル化を用いて n = 100 の場合を計算しましょう。 + n = 10,000 の時の総和はいくつでしょうか? + +```{r} +sum(1/(1:100)^2) +sum(1/(1:1e04)^2) +n <- 10000 +sum(1/(1:n)^2) +``` + +We can also obtain the same results using a function: + +```{r} +inverse_sum_of_squares <- function(n) { + sum(1/(1:n)^2) +} +inverse_sum_of_squares(100) +inverse_sum_of_squares(10000) +n <- 10000 +inverse_sum_of_squares(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Operations on vectors of unequal length + +Operations can also be performed on vectors of unequal length, through +a process known as _recycling_. This process automatically repeats the smaller vector +until it matches the length of the larger vector. R will provide a warning +if the larger vector is not a multiple of the smaller vector. + +```{r} +x <- c(1, 2, 3) +y <- c(1, 2, 3, 4, 5, 6, 7) +x + y +``` + +Vector `x` was recycled to match the length of vector `y` + +```{r, eval=FALSE} +x: 1 2 3 1 2 3 1 + + + + + + + + +y: 1 2 3 4 5 6 7 +----------------------- + 2 4 6 5 7 9 8 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use vectorized operations instead of loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/10-functions.Rmd b/locale/ja/episodes/10-functions.Rmd new file mode 100644 index 000000000..c514ca401 --- /dev/null +++ b/locale/ja/episodes/10-functions.Rmd @@ -0,0 +1,501 @@ +--- +title: Functions Explained +teaching: 45 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define a function that takes arguments. +- Return a value from a function. +- Check argument conditions with `stopifnot()` in functions. +- Test a function. +- Set default values for function arguments. +- Explain why we should divide programs into small, single-purpose functions. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write a new function in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +分析したいデータセットが一つだけなら、ファイルを表計算ソフトで読み込み、単純な統計値をプロットした方が早いでしょう。 しかし、gapmider データは定期的に更新されるので、後から新しい情報を読み込み分析し直したくなります。 また、将来的には似たようなデータを違う場所から入手することもあるでしょう。 + +この講義では関数の書き方を学ぶことで、同じ操作を一つのコマンドで繰り返せるようになります。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## 関数とは何でしょう? + +関数は連続した操作を一つに纏め、後から使うときのために保存しておきます。 Functions provide: + +- a name we can remember and invoke it by +- relief from the need to remember the individual operations +- a defined set of inputs and expected outputs +- rich connections to the larger programming environment + +As the basic building block of most programming languages, user-defined +functions constitute "programming" as much as any single abstraction can. 関数を書いた時点であなたはコンピュータープログラマーです。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 関数を定義しましょう。 + +`functions/` ディレクトリ内に新しく functions-lesson.R と名付けた R スクリプトを作成して開きましょう。 + +The general structure of a function is: + +```{r} +my_function <- function(parameters) { + # perform action + # return value +} +``` + +華氏をケルビンに変換する `fahr_to_kelvin()` という関数を定義しましょう。 + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +`fahr_to_kelvin()` を定義するには、`fahr_to_kelvin` に `function` の出力を指定します。 引数の名前の一覧は括弧に中に書きます。 次に関数の[本文 (body)]({}/reference/#function-body) として 走らせた時の実行内容を波括弧 (`{}`) の中に記述します。 本文はスペース二つでインデントしておきます。 これによりコードの操作内容を変更せずに可読性を向上させます。 + +It is useful to think of creating functions like writing a cookbook. First you define the "ingredients" that your function needs. In this case, we only need one ingredient to use our function: "temp". After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it. + +関数を呼び出す時、引数に指定した値は関数内で用いられる変数に与えられます。 関数の中では関数を呼び出した相手に結果を送るために [return 文]({}/reference/#return-statement) を用います。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント + +return 文が不要なことは R の変わった特徴の一つです。 +R では関数の本文の最終行に記述された変数が自動的に返り値になります。 しかし、わかりやすくするために return 文を明示的に記述します。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +関数を実行してみましょう。 +自分の関数を呼び出す方法は他の関数を呼び出す方法と同じです。 + +```{r} +# 水の凝固点 fahr_to_kelvin(32) +``` + +```{r} +# 水の沸点 fahr_to_kelvin(212) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Write a function called `kelvin_to_celsius()` that takes a temperature in +Kelvin and returns that temperature in Celsius. + +Hint: To convert from Kelvin to Celsius you subtract 273.15 + +::::::::::::::: solution + +## チャレンジ8の解答 1 + +Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +and returns that temperature in Celsius + +```{r} +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 関数を組み合わせましょう + +関数の真髄を発揮するのは、関数を混ぜ合わせ組み合わせてより多きな塊にすることで、望み通りの効果を得る時です。 + +華氏をケルビンに変換する関数とケルビンをセ氏に変換する関数の2つを定義しましょう。 + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} + +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer). + +::::::::::::::: solution + +## チャレンジ8の解答 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above + +```{r} +fahr_to_celsius <- function(temp) { + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 幕間: 防衛的プログラミング + +関数を書くことで R のコードを効率的に再利用したりモジュール化する方法を理解し始めたところですが、 関数は想定した用途でのみ機能するように確実に設計することが重要です。 関数の引数を検査することは 防衛的プログラミング の考え方に繋がります。 +防衛的プログラミングでは状況を頻繁に検査し何かおかしなことがあればエラーを返すことを推奨します。 このような検査は、プログラム実行を継続する前に現状が `TRUE` であることをアサート(表明・断言)するので、アサーション文と呼ばれます。 +アサーション文により、エラーがどこで起きているか分かりやすくなりデバッグが容易になります。 + +### `stopifnot()` を用いて状態を検査しましょう + +華氏をケルビンに変換する `fahr_to_kelvin()` 関数について再検討してみましょう。 この関数の定義は以下の通りです。 + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +For this function to work as intended, the argument `temp` must be a `numeric` +value; otherwise, the mathematical procedure for converting between the two +temperature scales will not work. To create an error, we can use the function +`stop()`. For example, since the argument `temp` must be a `numeric` vector, we +could check for this condition with an `if` statement and throw an error if the +condition was violated. We could augment our function above like so: + +```{r} +fahr_to_kelvin <- function(temp) { + if (!is.numeric(temp)) { + stop("temp must be a numeric vector.") + } + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +複数の状態や引数を検査する必要があると、全てを検査するためのコードは何行にも渡ります。 幸いなことに R は `stopifnot` という便利な関数を提供しています。 `TRUE` と評価されるべき要件を必要なだけ列挙すると、 `stopifnot()` は一つでも `FALSE` がある場合にエラーを返します。 検査項目を列挙すると、追加のドキュメント化という2つ目の目的としても機能します。 + +`stopifnot()` を用いて `fahr_to_kelvin()` に入力を検査するアサーション文を追加し、 防衛的プログラミングに挑戦しましょう。 + +`temp` が数値ベクトルであることをアサートしたいとします。 以下のようにしましょう。 + +```{r} +fahr_to_kelvin <- function(temp) { + stopifnot(is.numeric(temp)) + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +入力が適切であればこれでも機能します。 + +```{r} +# 水の凝固点 fahr_to_kelvin(temp = 32) +``` + +しかし不適切な入力があるとすぐに失敗します。 + +```{r} +# Metric is a factor instead of numeric +fahr_to_kelvin(temp = as.factor(32)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +防衛的プログラミングにより、`fahr_to_celsius()` 関数の `temp` 引数に不適切な 値が指定されたらすぐにエラーを返すよう念押しして下さい。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +チャレンジ3の解答 明示的に `stopifnot()` を呼ぶことで先述の関数の定義を拡張しましょう。 `fahr_to_celsius()` は2つの他の関数から構成されているので、 ここでの検査は2つの関数の検査に追加され冗長になります。 + +```{r} +fahr_to_celsius <- function(temp) { + stopifnot(is.numeric(temp)) + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## もっと関数を組み合わせましょう + +ここで我々のデータセットで利用できるデータからある国の国内総生産を計算するための関数を定義します。 + +```{r} +# データセットを受け取り、人口の列と一人あたりのGDPをかけます。 calcGDP <- function(dat) { gdp <- dat$pop dat$gdpPercap return(gdp} +``` + +`calcGDP()` を定義するために、`function` の結果を `calcGDP` に代入します。 引数の名前の一覧は括弧に中に書きます。 次に、本文 -- 関数を読んだ時に実行される命令文 -- は波括弧 (`{}`) の中に書きます。 + +本文中の命令文は2つのスペースでインデントしました。 これにより関数の動作に影響を及ぼさずに可読性を向上できます。 + +関数を呼び出す時に、関数に渡した値は引数に指定され、 関数の本文中における変数になります。 + +関数の中では `return()` 関数を用いて結果を返します。 +`return()` 関数は必須ではなく、R は 関数の最終行で実行されたコマンドの結果を自動的に返します。 + +```{r} +calcGDP(head(gapminder)) +``` + +That's not very informative. これでは情報に乏しいです。いくつか引数を追加して、年ごとと国ごとの情報を得られるようにしましょう。 + +```{r} +# データセットを受け取り、人口の列と一人あたりのGDPの列をかけます。 calcGDP <- function(dat, year=NULL, country=NULL) { if(!is.null(year)) { dat <- dat[dat$year %in% year, ] } if (!is.null(country)) { dat <- dat[dat$country %in% country,] } gdp <- dat$pop dat$gdpPercap new <- cbind(dat, gdp=gdp) return(new} +``` + +もしこれらの関数を別の R スクリプトに書いているなら (グッドアイディア!)、 `source()` 関数を使って関数を R セッションに読み込むことができます。 + +```{r, eval=FALSE} +source("functions/functions-lesson.R") +``` + +Ok, so there's a lot going on in this function now. In plain English, the +function now subsets the provided data by year if the year argument isn't empty, +then subsets the result by country if the country argument isn't empty. Then it +calculates the GDP for whatever subset emerges from the previous two steps. The +function then adds the GDP as a new column to the subsetted data and returns +this as the final result. You can see that the output is much more informative +than a vector of numbers. + +`year` を指定した時に何が起きるか見てみましょう。 + +```{r} +head(calcGDP(gapminder, year=2007)) +``` + +また `country` を指定するとどうなるでしょうか。 + +```{r} +calcGDP(gapminder, country="Australia") +``` + +あるいは両方指定してみましょう。 + +```{r} +calcGDP(gapminder, year=2007, country="Australia") +``` + +関数の本文を順番に見ていきましょう。 + +```{r, eval=FALSE} +calcGDP <- function(dat, year=NULL, country=NULL) { +``` + +ここで `year` と `country` の二つの引数を追加しました。 `=` 演算子を関数定義時に用いることで、両者の 既定値 には `NULL` を指定しています。 これにより、ユーザーが値を指定しない限り、これらの引数の値は `NULL` になることを意味します。 + +```{r, eval=FALSE} + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } +``` + +ここでは、追加した引数それぞれについて値が `NULL` であるか確認し、 `NULL` でなければ `dat` に格納されたデータセットを非 `NULL` な引数の値を用いて絞り込み上書きします。 + +Building these conditionals into the function makes it more flexible for later. +この関数を用いて、以下の様々な場合のGDPを計算できます。 + +- データセット全体 +- ある年 +- ある国 +- ある年とある国の組み合わせ + +代わりに `%in%` を使うことによって、`year` と `country` に複数の値を指定できるようになっています。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: 値渡し + +Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify `dat` inside the function +we are modifying the copy of the gapminder dataset stored in `dat`, +not the original variable we gave as the first argument. + +This is called "pass-by-value" and it makes writing code much safer: +you can always be sure that whatever changes you make within the +body of the function, stay inside the body of the function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: 関数のスコープ + +Another important concept is scoping: any variables (or functions!) you +create or modify inside the body of a function only exist for the lifetime +of the function's execution. 関数の本文中で作成したり変更したいかなる変数 (関数を含む!) は、 関数を実行している間だけ存在します。`calcGDP()` を呼んだ時に、 `dat`、`gdp`、そして `new` という変数は関数の本文中でのみ存在します。 対話的な R のセッションにおいて同名の変数が存在していたとして、 それらは関数実行時に変更されることはありません。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, eval=FALSE} + gdp <- dat$pop * dat$gdpPercap + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +Finally, we calculated the GDP on our new subset, and created a new data frame +with that column added. 最終的に、絞り込んだデータからGDPを計算し、その結果を列に追加した新しいデータフレームを作成しました。 これは関数を呼び出した後でも返り値のGDPの値が持つ文脈がわかることを意味します。 従って、最初に試した数値のベクトルを返す方法よりもずっと良いものです。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ4 + +GDP を計算する関数をテストするため、1987年の New Zealand の GDP を計算して下さい。 1952 年の New Zealand の GDP とはどう違いますか? + +::::::::::::::: solution + +## チャレンジ8の解答 + +```{r, eval=FALSE} + calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand") +``` + +GDP for New Zealand in 1987: 65050008703 + +GDP for New Zealand in 1952: 21058193787 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ5 + +The `paste()` function can be used to combine text together, e.g: + +```{r} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +paste(best_practice, collapse=" ") +``` + +Write a function called `fence()` that takes two vectors as arguments, called +`text` and `wrapper`, and prints out the text wrapped with the `wrapper`: + +```{r, eval=FALSE} +fence(text=best_practice, wrapper="***") +``` + +_Note:_ the `paste()` function has an argument called `sep`, which specifies +the separator between text. The default is a space: " ". The default for +`paste0()` is no space "". + +::::::::::::::: solution + +## チャレンジ8の解答 + +Write a function called `fence()` that takes two vectors as arguments, +called `text` and `wrapper`, and prints out the text wrapped with the +`wrapper`: + +```{r} +fence <- function(text, wrapper){ + text <- c(wrapper, text, wrapper) + result <- paste(text, collapse = " ") + return(result) +} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +fence(text=best_practice, wrapper="***") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント + +R より複雑な演算を行う時に利用できる変わった機能があります。 ここでは発展的な概念を知っておく必要のあることは書きません。 将来的に R で関数を書くことに慣れたら、 [R Language Manual][man] や Hadley Wickham による [Advanced R Programming][adv-r] のこの\[章]\[]を読んで学んで下さい。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: テストとドキュメント + +It's important to both test functions and document them: +Documentation helps you, and others, understand what the +purpose of your function is, and how to use it, and its +important to make sure that your function actually does +what you think. + +When you first start out, your workflow will probably look a lot +like this: + +1. Write a function +2. Comment parts of the function to document its behaviour +3. Load in the source file +4. Experiment with it in the console to make sure it behaves + as you expect +5. Make any necessary bug fixes +6. Rinse and repeat. + +Formal documentation for functions, written in separate `.Rd` +files, gets turned into the documentation you see in help +files. The [roxygen2] package allows R coders to write documentation +alongside the function code and then process it into the appropriate `.Rd` +files. You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In fact, +packages are, in essence, bundles of functions with this formal documentation. +Loading your own functions through `source("functions.R")` is equivalent to +loading someone else's functions (or your own one day!) through +`library("package")`. + +Formal automated tests can be written using the [testthat] package. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[man]: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects +[chapter]: https://adv-r.had.co.nz/Environments.html +[adv-r]: https://adv-r.had.co.nz/ +[roxygen2]: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html +[testthat]: https://r-pkgs.had.co.nz/tests.html + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `function` to define a new function in R. +- Use parameters to pass values into functions. +- Use `stopifnot()` to flexibly check function arguments in R. +- Load functions into programs using `source()`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/11-writing-data.Rmd b/locale/ja/episodes/11-writing-data.Rmd new file mode 100644 index 000000000..4ac82c25f --- /dev/null +++ b/locale/ja/episodes/11-writing-data.Rmd @@ -0,0 +1,158 @@ +--- +title: データの出力 +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to write out plots and data from R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I save plots and data created in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +library("ggplot2") +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +dir.create("cleaned-data") +``` + +## Saving plots + +`ggsave` 関数を使って、`ggplot2` を用いて作った直近の図を保存する方法は既に学んだ通りです。 おさらいしておきましょう。 + +```{r, eval=FALSE} +ggsave("My_most_recent_plot.pdf") +``` + +RStudio から図を保存するには "Plot" ウィンドウにある "Export" ボタンを使います。 この機能を使うと、図を .pdf や .png、.jpg など様々な画像形式で保存することができます。 + +図を "Plot" ウィンドウに表示することなく保存したいこともあるでしょう。 ページごとに一つの図を含むような複数のページから成る PDF を出力したいこともあるでしょう。 あるいはループを使ってファイル中のデータの絞り込み条件を変えながら複数の図を作成していて、 一つずつ保存するためにいちいちループを止めて "Export" ボタンを押すことが不可能な場合もあるでしょう。 + +このような場合、より柔軟な方法を使います。 `pdf` 関数は新たに PDF デバイスを作ります。 この関数に引数を指定することで、サイズや解像度を調整できます。 + +```{r, eval=FALSE} +~~~ pdf("Life_Exp_vs_time.pdf", width=12, height=4) ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) + geom_line() + theme(legend.position = "none") # PDF デバイスを確実に終了しておく必要があります! +``` + +このドキュメントを開いて内容を確認してみましょう。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Rewrite your 'pdf' command to print a second +page in the pdf, showing a facet plot (hint: use `facet_grid`) +of the same data with one panel per continent. + +::::::::::::::: solution + +## チャレンジ8の解答 1 + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width = 12, height = 4) +p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) + + geom_line() + + theme(legend.position = "none") +p +p + facet_grid(~continent) +dev.off() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +同様にして `jepg` や `png` などのコマンドを使うことで、他の形式のドキュメントを作成できます。 + +## データの出力 + +R からデータを出力したいこともあります。 + +これには、既に使った `read.table` 関数とよく似た `write.table` 関数を使います。 + +データを綺麗にするスクリプトを書きましょう。 今回の分析では、gapminder データの中のオーストラリアに該当する部分だけを使います。 + +```{r} +aust_subset <- gapminder[gapminder$country == "Australia",] + +write.table(aust_subset, + file="cleaned-data/gapminder-aus.csv", + sep="," +) +``` + +シェルに戻ってデータが正しく出力されているか確認しましょうOK: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +Hmm, that's not quite what we wanted. Where did all these +quotation marks come from? Also the row numbers are +meaningless. + +ヘルプファイルを見てこの動作を変更しましょう。 + +```{r, eval=FALSE} +?write.table +``` + +既定では、R は文字列をファイルに出力する時に引用符で囲みます。 また、行と列の名前も出力します。 + +以下のように修正しましょう。 + +```{r} +write.table( + gapminder[gapminder$country == "Australia",], + file="cleaned-data/gapminder-aus.csv", + sep=",", quote=FALSE, row.names=FALSE +) +``` + +もう一度シェルを使ってデータを確認してみましょう。 + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +よくなりました! + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +gapminder データを1990年以降のデータのみに絞り込むスクリプトを書いて下さい。 + +このスクリプトを使って絞り込んだ結果を `cleaned-data/` ディレクトリ中のファイルに出力しましょう。 + +::::::::::::::: solution + +## チャレンジ8の解答 + +```{r, eval=FALSE} +チャレンジ1990の解答 ~~~ write.table( gapminder[gapminder$year , ], file = "cleaned-data/gapminder-after1990.csv", sep = ",", quote = FALSE, row.names = FALSE) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE} +# We remove after rendering the lesson, because we don't want this in the lesson +# repository +unlink("cleaned-data", recursive=TRUE) +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Save plots from RStudio using the 'Export' button. +- Use `write.table` to save tabular data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/12-dplyr.Rmd b/locale/ja/episodes/12-dplyr.Rmd new file mode 100644 index 000000000..f9f344b22 --- /dev/null +++ b/locale/ja/episodes/12-dplyr.Rmd @@ -0,0 +1,447 @@ +--- +title: Data Frame Manipulation with dplyr +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use the six main data frame manipulation 'verbs' with pipes in `dplyr`. +- To understand how `group_by()` and `summarize()` can be combined to summarize datasets. +- Be able to analyze a subset of data using logical filtering. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate data frames without repeating myself? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +多くの研究者にとって、データフレームの操作は、多くのことを意味します。 よくあるのは、特定の観測値(行)もしくは変数(列)の選択、 特定の変数でのデータのグループ化、 更には要約する統計値の計算です。 これらは、普通のRの基本操作で実行できます: We can +do these operations using the normal base R operations: + +```{r} +mean(gapminder$gdpPercap[gapminder$continent == "Africa"]) +mean(gapminder$gdpPercap[gapminder$continent == "Americas"]) +mean(gapminder$gdpPercap[gapminder$continent == "Asia"]) +``` + +でも、これはあまり おススメ ではありません。繰り返しがかなりあるからです。 繰り返し作業は、時間を食います。そして、嫌なバグを起こる原因にもなりえます。 + +## `dplyr` パッケージ + +嬉しいことに、[`dplyr`](https://cran.r-project.org/package=dplyr) パッケージには、データフレーム操作に非常に役立つ関数がいくつもあります。 それを使うと、先ほどみたような繰り返しを減らし、エラーを起こす確率を減らし、 タイピングする必要性さえも恐らく減らせます。 更には、`dplyr` の書き方は、とても分かりやすいかもしれません。 As an added bonus, you might +even find the `dplyr` grammar easier to read. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Tidyverse + +`dplyr` package belongs to a broader family of opinionated R packages +designed for data science called the "Tidyverse". These +packages are specifically designed to work harmoniously together. +Some of these packages will be covered along this course, but you can find more +complete information here: [https://www.tidyverse.org/](https://www.tidyverse.org/). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +ここでは、特によく使われる6つの関数と、 それらを組み合わせるためのパイプ(`% %`)を紹介します。 . + +1. `select()` +2. `filter()` +3. `group_by()` +4. `summarize()` +5. `mutate()` + +もし、このパッケージをまだインストールしていないようでしたら、ここでしておきましょう: + +```{r, eval=FALSE} +install.packages('dplyr') +``` + +パッケージをロードしましょう: + +```{r, message=FALSE} +library("dplyr") +``` + +## select() の使用 + +例えば、 データフレームにある、いくつかの変数だけを使って進めたい場合、 使えるかもしれないのは、`select()` 関数です。 これを使えば、選択した変数だけをキープすることができます。 + +```{r} +year_country_gdp <- select(gapminder, year, country, gdpPercap) +``` + +![](fig/13-dplyr-fig1.png){alt='Diagram illustrating use of select function to select two columns of a data frame'} +If we want to remove one column only from the `gapminder` data, for example, +removing the `continent` column. + +```{r} +smaller_gapminder_data <- select(gapminder, -continent) +``` + +もし `year_country_gdp` を開いたら、year、country 及び gdpPercap しかないでしょう。 これまでは、 '普通の' 書き方を使いましたが、`dplyr` の強みは、複数の関数を パイプを使って、組み合わせられることです。 パイプの書き方は、これまでRで見てきたものとは、 全く違いますので、上記でしたことをパイプを使って、やってみましょう。 + +```{r} +year_country_gdp <- gapminder %>% select(year, country, gdpPercap) +``` + +To help you understand why we wrote that in that way, let's walk through it step +by step. First we summon the gapminder data frame and pass it on, using the pipe +symbol `%>%`, to the next step, which is the `select()` function. In this case +we don't specify which data object we use in the `select()` function since in +gets that from the previous pipe. **Fun Fact**: There is a good chance you have +encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the +shell it is `|` but the concept is the same! + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Renaming data frame columns in dplyr + +In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the `names()` function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a `rename()` function. + +Within a pipeline, the syntax is `rename(new_name = old_name)`. +For example, we may want to rename the gdpPercap column name from our `select()` statement above. + +```{r} +tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap) + +head(tidy_gdp) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## filter() の使用 + +欧州のみで、上記を進めたいとしたら、 `select` と `filter` を組み合わせましょう。 + +```{r} +year_country_gdp_euro <- gapminder %>% + filter(continent == "Europe") %>% + select(year, country, gdpPercap) +``` + +If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below. + +```{r} +europe_lifeExp_2007 <- gapminder %>% + filter(continent == "Europe", year == 2007) %>% + select(country, lifeExp) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Write a single command (which can span multiple lines and includes pipes) that +will produce a data frame that has the African values for `lifeExp`, `country` +and `year`, but not for other Continents. How many rows does your data frame +have and why? + +::::::::::::::: solution + +## チャレンジ3の解答 + +```{r} +year_country_lifeExp_Africa <- gapminder %>% + filter(continent == "Africa") %>% + select(year, country, lifeExp) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +以前行ったように、gapminder データフレームを `filter()` 関数に引き渡し、 フィルターされた バージョンのgapminder データフレームを、 `select()` 関数に引き渡します。 注意: ここでは、操作手順がとても重要です。 まず 'select' を使うと、その前のステップで、大陸の変数が削除されているため、 filter で大陸の変数を見つけることができないことでしょう。 + +## Using group\_by() + +Now, we were supposed to be reducing the error prone repetitiveness of what can +be done with base R, but up to now we haven't done that since we would have to +repeat the above for each continent. Instead of `filter()`, which will only pass +observations that meet your criteria (in the above: `continent=="Europe"`), we +can use `group_by()`, which will essentially use every unique criteria that you +could have used in filter. + +```{r} +str(gapminder) + +str(gapminder %>% group_by(continent)) +``` + +`group_by()` (`grouped_df`)で用いたデータフレームのデータ構造は、もともとの `gapminder` (`data.frame`)とは異なることに気づいたことでしょう。 `grouped_df` は、 `list` のようなものです。その `list` にある各項目は、 (少なくとも上記の例では)特定の `continent` の値が対応する列のみを含む `data.frame` になります。 + +![](fig/13-dplyr-fig2.png){alt='Diagram illustrating how the group by function oraganizes a data frame into groups'} + +## summarize() の使用 + +The above was a bit on the uneventful side but `group_by()` is much more +exciting in conjunction with `summarize()`. This will allow us to create new +variable(s) by using functions that repeat for each of the continent-specific +data frames. That is to say, using the `group_by()` function, we split our +original data frame into multiple pieces, then we can run functions +(e.g. `mean()` or `sd()`) within `summarize()`. + +```{r} +gdp_bycontinents <- gapminder %>% + group_by(continent) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +![](fig/13-dplyr-fig3.png){alt='Diagram illustrating the use of group by and summarize together to create a new variable'} + +```{r, eval=FALSE} +continent mean_gdpPercap + +1 Africa 2193.755 +2 Americas 7136.110 +3 Asia 7902.150 +4 Europe 14469.476 +5 Oceania 18621.609 +``` + +これにより、それぞれの大陸の平均gdpPercapを計算することができますが、 更に、すばらしいことがあるのです。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +Calculate the average life expectancy per country. Which has the longest average life +expectancy and which has the shortest average life expectancy? + +::::::::::::::: solution + +## チャレンジ3の解答 + +```{r} +lifeExp_bycountry <- gapminder %>% + group_by(country) %>% + summarize(mean_lifeExp = mean(lifeExp)) +lifeExp_bycountry %>% + filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp)) +``` + +Another way to do this is to use the `dplyr` function `arrange()`, which +arranges the rows in a data frame according to the order of one or more +variables from the data frame. It has similar syntax to other functions from +the `dplyr` package. You can use `desc()` inside `arrange()` to sort in +descending order. + +```{r} +lifeExp_bycountry %>% + arrange(mean_lifeExp) %>% + head(1) +lifeExp_bycountry %>% + arrange(desc(mean_lifeExp)) %>% + head(1) +``` + +Alphabetical order works too + +```{r} +lifeExp_bycountry %>% + arrange(desc(country)) %>% + head(1) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::: + +`group_by()` の関数では、複数の変数でグループ化するこもできます。 `year` と `continent` でグループ分けしてみましょう。 + +```{r} +gdp_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop)) +``` + +## count() 及び n() + +A very common operation is to count the number of observations for each +group. The `dplyr` package comes with two related functions that help with this. + +例えば、2002年のデータセットにある国の数を確認したい場合、 `count()` 関数が使えます。 興味のあるグループのひとつかいくつかの行の名前を取り、 `sort=TRUE` を加えることで、結果を降順に並べることもできます: + +```{r} +gapminder %>% + filter(year == 2002) %>% + count(continent, sort = TRUE) +``` + +演算の際の観測値の数が必要な場合 `n()` 関数が使えます。 It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. 例えば、大陸別平均余命の標準誤差を得たいとします: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize(se_le = sd(lifeExp)/sqrt(n())) +``` + +いくつかの要約計算を、つなぎ合わせることもできます。つまり、ここでは各大陸の国別平均余命の `minimum` 、 `maximum` 、 `mean` 及び `se` となります: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize( + mean_le = mean(lifeExp), + min_le = min(lifeExp), + max_le = max(lifeExp), + se_le = sd(lifeExp)/sqrt(n())) +``` + +## mutate() の使用 + +情報を要約する前に(もしくは後にでも)、 `mutate()` を使えば、新しい変数を作ることができます。 + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + mutate(gdp_billion = gdpPercap*pop/10^9) %>% + group_by(continent,year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) +``` + +## 論理フィルター ifelse とmutate の併用 + +新しい変数を作る時、論理条件を付けることができます。 似たような組み合わせの `mutate()` と `ifelse()` は、まさに必要な場面、つまり 新しいものを作る時に、フィルターすることができます。 +この簡単に読めるコードが、(データフレーム全体の次元を変えずに)あるデータを 排除するための早くて役に立つ方法であり、 与えられた条件によって値を更新する方法なのです。 + +```{r} +## keeping all data but "filtering" after a certain condition +# calculate GDP only for people with a life expectation above 25 +gdp_pop_bycontinents_byyear_above25 <- gapminder %>% + mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) + +## updating only if certain condition is fullfilled +# for life expectations above 40 years, the gpd to be expected in the future is scaled +gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>% + mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + mean_gdpPercap_expected = mean(gdp_futureExpectation)) +``` + +## `dplyr` と `ggplot2` の併用 + +First install and load ggplot2: + +```{r, eval=FALSE} +install.packages('ggplot2') +``` + +```{r, message=FALSE} +library("ggplot2") +``` + +プロットのレッスンでは、 `ggplot2` を使って、ファセットパネルの層を加えることで、 複数パネルの図を示す方法を見ました。 以下が、(いくつかコメントを足してありますが)使用したコードです: + +```{r} +# Filter countries located in the Americas +americas <- gapminder[gapminder$continent == "Americas", ] +# Make the plot +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +このコードは、正しいプロットを作りますが、他に使い道のない、変数(`starts.with` 及び `az.countries`)も作ります。 `dplyr` 関数のチェーンで、 `% %` を使って、 データをパイプしたように、 `ggplot()` へデータを引き渡すこともできます。 なぜならば `% %` は、関数の最初の引数を置き換えるため、 `ggplot()` 関数の中の、 `data =` 因数を指定する必要がありません。 `dplyr` と `ggplot2` 関数を組み合わせることで、同じ図を、新しい変数を作ったり、 データを修正することなく作成できます。 + +```{r} +gapminder %>% + # Filter countries located in the Americas + filter(continent == "Americas") %>% + # Make the plot + ggplot(mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +More examples of using the function `mutate()` and the `ggplot2` package. + +```{r} +gapminder %>% + # extract first letter of country name into new column + mutate(startsWith = substr(country, 1, 1)) %>% + # only keep countries starting with A or Z + filter(startsWith %in% c("A", "Z")) %>% + # plot lifeExp into facets + ggplot(aes(x = year, y = lifeExp, colour = continent)) + + geom_line() + + facet_wrap(vars(country)) + + theme_minimal() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## 上級チャレンジ + +各大陸から無作為に選ばれた2つの国の2002年の平均余命を計算し、 大陸名を、逆の順番に並べましょう。 ヒント: `dplyr` 関数 `arrange()` 及び `sample_n()` を使いましょう。 +書き方は、他の dplyr 関数と同じです。 + +::::::::::::::: solution + +## Solution to Advanced Challenge + +```{r} +lifeExp_2countries_bycontinents <- gapminder %>% + filter(year==2002) %>% + group_by(continent) %>% + sample_n(2) %>% + summarize(mean_lifeExp=mean(lifeExp)) %>% + arrange(desc(mean_lifeExp)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## その他役に立つ資料 + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to dplyr](https://dplyr.tidyverse.org/) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) +- [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) (online book) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `dplyr` package to manipulate data frames. +- Use `select()` to choose variables from a data frame. +- Use `filter()` to choose data based on values. +- Use `group_by()` and `summarize()` to work with subsets of data. +- Use `mutate()` to create new variables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/13-tidyr.Rmd b/locale/ja/episodes/13-tidyr.Rmd new file mode 100644 index 000000000..6534c0b33 --- /dev/null +++ b/locale/ja/episodes/13-tidyr.Rmd @@ -0,0 +1,304 @@ +--- +title: Data Frame Manipulation with tidyr +teaching: 30 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand the concepts of 'longer' and 'wider' data frame formats and be able to convert between them with `tidyr`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I change the layout of a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE, stringsAsFactors = FALSE) +gap_wide <- read.csv("data/gapminder_wide.csv", header = TRUE, stringsAsFactors = FALSE) +``` + +研究者には「横長」データを「縦長」データに(又はその逆を)したいと 思うことがよくあります。 「縦長」形式とは: + +- 各列が変数 +- 各行が観測値 + +「縦長」形式では、普通、観測値は1列で、残りの列はIDの変数になります。 + +For the 'wide' format each row is often a site/subject/patient and you have +multiple observation variables containing the same type of data. These can be +either repeated observations over time, or observation of multiple variables (or +a mix of both). You may find data input may be simpler or some other +applications may prefer the 'wide' format. However, many of `R`'s functions have +been designed assuming you have 'longer' formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format. + +![](fig/14-tidyr-fig1.png){alt='Diagram illustrating the difference between a wide versus long layout of a data frame'} + +Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due +to its shape. However, the long format is more machine readable and is closer +to the formatting of databases. The ID variables in our data frames are similar to +the fields in a database and observed variables are like the database values. + +## 手始めに + +まず、パッケージをインストールしましょう、もしまだやっていなければですが (おそらく、前の dplyr のレッスンで、インストールしているかと思います): + +```{r, eval=FALSE} +#install.packages("tidyr") +#install.packages("dplyr") +``` + +パーッケージをロードしましょう。 + +```{r, message=FALSE} +library("tidyr") +library("dplyr") +``` + +始めに、そもそもの gapminder データフレームのデータ構造を見てみましょう: + +```{r} +str(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +gapminder は、横長のみ、縦長のみ、又はその中間の形式でしょうか。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +チャレンジ1の解答 元々の gapminder data.frame は、中間の形式です。 複数の観測変数(`pop`,`lifeExp`,`gdpPercap`)があるため、 縦長のみのデータとは言えません。 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Sometimes, as with the gapminder dataset, we have multiple types of observed +data. It is somewhere in between the purely 'long' and 'wide' data formats. We +have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation +variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be +preferred despite not having ALL observations in 1 column given that all 3 +observation variables have different units. There are few operations that would +need us to make this data frame any longer (i.e. 4 ID variables and 1 +Observation variable). + +While using many of the functions in R, which are often vector based, you +usually do not want to do mathematical operations on values with different +units. For example, using the purely long format, a single mean for all of the +values of population, life expectancy, and GDP would not be meaningful since it +would return the mean of values with 3 incompatible units. The solution is that +we first manipulate the data either by grouping (see the lesson on `dplyr`), or +we change the structure of the data frame. **Note:** Some plotting functions in +R actually work better in the wide format data. + +## gather() を使って、横長から縦長形式へ + +Until now, we've been using the nicely formatted original gapminder dataset, but +'real' data (i.e. our own research data) will never be so well organized. Here +let's start with the wide formatted version of the gapminder dataset. + +> Download the wide version of the gapminder data from [this link to a csv file](data/gapminder_wide.csv) +> and save it in your data folder. + +We'll load the data file and look at it. データファイルをロードして見てみましょう。 注:大陸と国の列は、因子型にはしたくありません。 そこで、そうならないように、`read.csv()`に stringsAsFactors 引数を使いましょう。 + +```{r} +gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE) +str(gap_wide) +``` + +![](fig/14-tidyr-fig2.png){alt='Diagram illustrating the wide format of the gapminder data frame'} + +To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable. + +![](fig/14-tidyr-fig3.png){alt='Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +ここでは、以前 dplyr のレッスンの中で行ったようなパイプの書き方を使いました。 実は、tidyr と dplyr 関数は互換性があり、パイプで繋ぐことで、一緒に使うことが できるのです。 + +We first provide to `pivot_longer()` a vector of column names that will be +pivoted into longer format. We could type out all the observation variables, but +as in the `select()` function (see `dplyr` lesson), we can use the `starts_with()` +argument to select all variables that start with the desired character string. +`pivot_longer()` also allows the alternative syntax of using the `-` symbol to +identify which variables are not to be pivoted (i.e. ID variables). + +The next arguments to `pivot_longer()` are `names_to` for naming the column that +will contain the new ID variable (`obstype_year`) and `values_to` for naming the +new amalgamated observation variable (`obs_value`). We supply these new column +names as strings. + +![](fig/14-tidyr-fig4.png){alt='Diagram illustrating the long format of the gapminder data'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(-continent, -country), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +これは、このデータフレームでは、取るに足らないことかもしれませんが、 1つの ID 変数と40の変則的な変数名を持つ観測変数がある場合も時にはあります。 柔軟性があることで、かなり時間が節約できるのです! + +Now `obstype_year` actually contains 2 pieces of information, the observation +type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the +`separate()` function to split the character strings into multiple variables + +```{r} +gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") +gap_long$year <- as.integer(gap_long$year) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +Using `gap_long`, calculate the mean life expectancy, population, and gdpPercap for each continent. +**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson + +::::::::::::::: solution + +## チャレンジ3の解答 + +```{r} +gap_long %>% group_by(continent, obs_type) %>% + summarize(means=mean(obs_values)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## spread() で縦長から中間形式へ + +It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let's start with the intermediate format. + +The `pivot_wider()` function takes `names_from` and `values_from` arguments. + +To `names_from` we supply the column name whose contents will be pivoted into new +output columns in the widened data frame. The corresponding values will be added +from the column named in the `values_from` argument. + +```{r} +gap_normal <- gap_long %>% + pivot_wider(names_from = obs_type, values_from = obs_values) +dim(gap_normal) +dim(gapminder) +names(gap_normal) +names(gapminder) +``` + +元々の `gapminder` と同じ次元を持つ、中間形式のデータフレーム `gap_normal` ができましたが、 変数の順番が違います。 このふたつが、 `all.equal()` かを調べる前に、これを直しましょう。 + +```{r} +gap_normal <- gap_normal[, names(gapminder)] +all.equal(gap_normal, gapminder) +head(gap_normal) +head(gapminder) +``` + +もうすぐです。元々のは、 `country` 、 `continent` 、そして `year` でソートされていました。 + +```{r} +gap_normal <- gap_normal %>% arrange(country, year) +all.equal(gap_normal, gapminder) +``` + +That's great! すばらしい!一番縦に長い形式から、中間形式に戻し、コードにエラーが でることもありませんでした。 + +Now let's convert the long all the way back to the wide. In the wide format, we +will keep country and continent as ID variables and pivot the observations +across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we +need to create appropriate labels for all our new variables (time\*metric +combinations) and we also need to unify our ID variables to simplify the process +of defining `gap_wide`. + +```{r} +gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_") +str(gap_temp) + +gap_temp <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") +str(gap_temp) +``` + +`unite()` を使い、`continent`と`country`を組み合わせ、ID 変数をひとつ作り、 変数名を定義しました。 `spread()` でパイプを使う準備が整いました。 + +```{r} +gap_wide_new <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +str(gap_wide_new) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +この一つ先に進み、 `gap_ludicrously_wide` を作り、国、年及び3つの行列に展開したデータを作りましょう。 +ヒント この新しいデータフレームには、5行しかありません。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +```{r} +gap_ludicrously_wide <- gap_long %>% + unite(var_names, obs_type, year, country, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +今、とても '横長な' 形式のデータフレームがありますが、 `ID_var` は、より使えるようにできるはずです。 `separate()` を使って、2変数に分けてみましょう。 + +```{r} +gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_") +gap_wide_betterID <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) %>% + separate(ID_var, c("continent","country"), sep = "_") +str(gap_wide_betterID) + +all.equal(gap_wide, gap_wide_betterID) +``` + +そこにまた戻りました! + +## その他役に立つ資料 + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `tidyr` package to change the layout of data frames. +- Use `pivot_longer()` to go from wide to longer layout. +- Use `pivot_wider()` to go from long to wider layout. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/14-knitr-markdown.Rmd b/locale/ja/episodes/14-knitr-markdown.Rmd new file mode 100644 index 000000000..64c4a9a9c --- /dev/null +++ b/locale/ja/episodes/14-knitr-markdown.Rmd @@ -0,0 +1,431 @@ +--- +title: Producing Reports With knitr +teaching: 60 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the value of writing reproducible reports +- Learn how to recognise and compile the basic components of an R Markdown file +- Become familiar with R code chunks, and understand their purpose, structure and options +- Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations +- Be aware of alternative output formats to which an R Markdown file can be exported + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I integrate software and reports? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r chunk_options, include=FALSE} +``` + +## データ分析報告 + +データ分析家は、協力者や、将来参照する文章として、分析及び結果を記した 報告を数多く書く傾向にあります。 + +Many new users begin by first writing a single R script containing all of their +work, and then share the analysis by emailing the script and various graphs +as attachments. But this can be cumbersome, requiring a lengthy discussion to +explain which attachment was which result. + +Writing formal reports with Word or [LaTeX](https://www.latex-project.org/) +can simplify this process by incorporating both the analysis report and output graphs +into a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy "whack-a-mole" +game of fixing new mistakes resulting from a single formatting change. + +Creating a report as a web page (which is an html file) using R Markdown makes things easier. +The report can be one long stream, so tall figures that wouldn't ordinarily fit on +one page can be kept at full size and easier to read, since the reader can simply +keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend +more time on your analyses instead of writing reports. + +## 読み書きできるプログラミング + +分析報告書のようなものは、_再現できる_ 文書であることが理想です。つまり、 エラーが見つかった場合や、データに追加があった場合など、単に報告書を 再コンパイルすることで、新しい、または正しい結果が得られるという形です (再現できない文書の場合、図を再作成し、Word文書に貼り付け、更に手作業で 様々な詳細結果に手を加えなえればなりません)。 + +The key R package here is [`knitr`](https://yihui.name/knitr/). It allows you +to create a document that is a mixture of text and chunks of +code. 文書がknitrで処理される際にRコードが実行され、 グラフや他の結果が文書に挿入されます。 + +こういうものを「読み書きできるプログラミング(literate programming)」と呼びます。 + +`knitr` allows you to mix basically any type of text with code from different programming languages, but we recommend that you use `R Markdown`, which mixes Markdown +with R. [Markdown](https://www.markdownguide.org/) is a light-weight mark-up language for creating web +pages. + +## R Markdownファイルの作成 + +R Studioで、File New File R Markdown をクリックしましょう。 すると、次のようなダイアログボックスが出ます: + +![](fig/New_R_Markdown.png){alt='Screenshot of the New R Markdown file dialogue box in RStudio'} + +デフォルト(HTML output)のままでよいので、タイトルを付けましょう。 + +## R Markdownの基本構成要素 + +The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want +to produce. In this case, we're creating an html document. + +``` +--- +title: "Initial R Markdown document" +author: "Karl Broman" +date: "April 23, 2015" +output: html_document +--- +``` + +入れたくなければ、これらのフィールドのいずれも消すことができます。 二重引用符は、厳密には _必須_ ではありません。 +大抵は、タイトルにコロンを含めたいときに使います。 + +RStudioは、始めやすいように、例がいくつか入れられてある文書を作成します。 以下のような「チャンク(chunk)」がすでにあるかと思います: + +
``{r}
+summary(cars)
+```
+
+ +これらはRコードの「チャンク」(塊)で、knitrによってコードが実行され、結果に置き換えられます。 また後ほど、詳しくお伝えします。 + +## Markdown + +Markdown is a system for writing web pages by marking up the text much +as you would in an email rather than writing html code. The marked-up +text gets _converted_ to html, replacing the marks with the proper +html code. + +とりあえず、ここにあるものを全部消して、少しMarkdownを書いてみましょう。 + +二重アスタリスクを使って、 太字 に(例 `bold`)、 下線を使って、 _斜体_ にすることができます(例 `_italics_`)。 + +You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this: + +``` +太字は、二重アスタリスクで 斜体は、下線で コードタイプのフォントは括弧で +``` + +または、このように: + +``` +太字は、二重アスタリスクで - 斜体は、下線で - コードタイプのフォントは括弧で +``` + +それぞれ、以下の形で表示されます: + +- 太字は、二重アスタリスクで +- 斜体は、下線で +- コードタイプのフォントは括弧で + +You can use whatever method you prefer, but _be consistent_. This maintains the +readability of your code. + +数字を使って番号付きリストを作ることもできます。 同じ番号を何度でも好きなだけ使えます: + +``` +1. 太字は、二重アスタリスクで 1. 斜体は、下線で 1. コードタイプのフォントは括弧で +``` + +これは、次のように表示されます: + +1. 太字は、二重アスタリスクで +2. 斜体は、下線で +3. コードタイプのフォントは括弧で + +行の頭に `#` 印を好きな数つけることで、色々なサイズの文節の題名を作ることができます: + +``` +# タイトル ## 主文節 ### 副文節 #### 更なる副文節 +``` + +You _compile_ the R Markdown document to an html webpage by clicking +the "Knit" button in the upper-left. + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ1 + +Create a new R Markdown document. Delete all of the R code chunks +and write a bit of Markdown (some sections, some italicized +text, and an itemized list). + +Convert the document to a webpage. + +::::::::::::::: solution + +## チャレンジ3の解答 + +In RStudio, select File > New file > R Markdown... + +Delete the placeholder text and add the following: + +``` +# Introduction + +## Background on Data + +This report uses the *gapminder* dataset, which has columns that include: + +* country +* continent +* year +* lifeExp +* pop +* gdpPercap + +## Background on Methods + +``` + +Then click the 'Knit' button on the toolbar to generate an html document (webpage). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## もうちょっとMarkdownについて + +次のような、ハイパーリンクを作ることができます: `[表示するテキスト](https://carpentries.org/)`. + +You can include an image file like this: `![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)` + +下付き文字 (例 F~2~)は `F~2` で、上付き文字(例 F^2^)は `F^2^` でできます。 + +[LaTeX](http://www.latex-project.org/)の等式の書き方を知っていれば、 `$ $` と `$$ $$` で数式を挿入することができます。 例えば、 `$E = mc^2$` や、 + +``` +$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$ +``` + +You can review Markdown syntax by navigating to the +"Markdown Quick Reference" under the "Help" field in the +toolbar at the top of RStudio. + +## Rコードの「チャンク(塊)」 + +The real power of Markdown comes from +mixing markdown with chunks of code. This is R Markdown. When +processed, the R code will be executed; if they produce figures, the +figures will be inserted in the final document. + +メインのコードチャンクは、こんな感じです: + +
``{r load_data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +That is, you place a chunk of R code between \`\`\`{r chunk\_name} +and \`\`\`. そうすると、エラーを修正するときに役立ちますし、グラフのファイル名は、生成されたコードのチャンクの名前に基づいて付けられるからです。 You can create code chunks quickly in RStudio using the shortcuts Ctrl\+Alt\+I on Windows and Linux, or Cmd\+Option\+I on Mac. + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ2 + +Add code chunks to: + +- Load the ggplot2 package +- Read the gapminder data +- Create a plot + +::::::::::::::: solution + +## チャレンジ3の解答 + +
``{r load-ggplot2}
+library("ggplot2")
+```
+
+ +
``{r read-gapminder-data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +
``{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## どうコンパイルされるか + +「Knit HTML」ボタンを押すと、R Markdown文書は[knitr](http://yihui.name/knitr) によって処理され、単純なMarkdown文書が(一連の図のファイルと共に)生成されます。 Rコードは実行され、入力と出力に置き換えられます。図が生成された場合、 これらの図へのリンクが挿入されます。 + +Markdownと図の文書は、[pandoc](http://pandoc.org/) というツールで処理され、 Markdownファイルは、図の埋め込まれたhtmlファイルに変換されます。 + +```{r rmd_to_html_fig, fig.width=8, fig.height=3, fig.align="left", echo=FALSE} +par(mar=rep(0, 4), bty="n", cex=1.5) +plot(0, 0, type="n", xlab="", ylab="", xaxt="n", yaxt="n", + xlim=c(0, 100), ylim=c(0, 100)) +xw <- 10 +yh <- 35 +xm <- 12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".Rmd") + +xm <- 50 +ym <- 80 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".md") +xm <- 50; ym <- 25 +for(i in c(2, 0, -2)) + rect(xm-xw/2+i, ym-yh/2+i, xm+xw/2+i, ym+yh/2+i, lwd=2, + border="black", col="white") +text(xm-2, ym-2, "figs/") + +xm <- 100-12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".html") + +arrows(22, 50, 38, 50, lwd=2, col="slateblue", len=0.1) +text((22+38)/2, 60, "knitr", col="darkslateblue", cex=1.3) + +arrows(62, 50, 78, 50, lwd=2, col="slateblue", len=0.1) +text((62+78)/2, 60, "pandoc", col="darkslateblue", cex=1.3) +``` + +## チャンクのオプション + +コードのチャンクがどう扱われるかは、様々なオプションによって決められます。 Here are some examples: + +- - コード自体を見せないためには、 `echo=FALSE` を使います +- - 結果を表示しないためには、 `results="hide"` を使います +- - コードを演算せず表示するためには、 `eval=FALSE` を使います +- - 警告やメッセージを隠す場合は、 `warning=FALSE` と `message=FALSE` を使います +- - 生成される図の大きさを(インチで)管理するためには、 `fig.height` と `fig.width` を使います + +なので、書くとすれば: + +
``{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+ +使いたいオプションを別のチャンクでも使いたい場合は、 _global_options_ を使います。例えば: + +
``{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+ +`fig.path` オプションは、図のファイルがどこに保存されるかを定義します。 ここでの `/` はとても重要で、もしこれがなければ、図は標準的な場所に 保存されますが、 `Figs` で始まる名前のファイルだけになります。 + +共有ディレクトリにR Markdownファイルが複数ある場合は、 `fig.path` を図のファイル名と異なる接頭語を付けるために使うといいかもしれません。 例えば、`fig.path="Figs/cleaning-"` と `fig.path="Figs/analysis-"` とすれば、 それぞれ `"cleaning-"` と `"analysis-"` で始まる図のファイルが `"Figs/"` ディレクトリに保存されます。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ3 + +図の大きさを管理し、コードを隠すチャンクのオプションを使ってみましょう。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +
``{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You can review all of the `R` chunk options by navigating to +the "R Markdown Cheat Sheet" under the "Cheatsheets" section +of the "Help" field in the toolbar at the top of RStudio. + +## 文中のRコード + +報告書にある、_全て_ の数を再現可能なものにすることができます。 インラインコードチャンクには \`r\` で囲んで、例えばこのように記述します:` ``r "r round(some_value, 2)"`` `。 このコードはコンパイル時に実行され、結果の _値_ に置き換えられます。 + +この文中のチャンクを、複数の行に分けて入れないようにしましょう。 + +こういう場合は、パラグラフの前に`include=FALSE` ( `echo=FALSE` と `results="hide"` の組み合わせと同じ)のオプションを設定した、 定義や演算を行うコードチャンクを作っておきましょう。 + +Rounding can produce differences in output in such situations. `2.0` が欲しくても、`round(2.03, 1)` ではただの `2` が出てきてしまいます。 + +[R/broman](https://github.com/kbroman)パッケージにある、 [`myround`](https://github.com/kbroman/broman/blob/master/R/myround.R)関数が、 この問題に対処してくれます。 + +::::::::::::::::::::::::::::::::::::::: challenge + +## チャレンジ4 + +文中のRコードを少し試してみましょう。 + +::::::::::::::: solution + +## チャレンジ3の解答 + +Here's some inline code to determine that 2 + 2 = `r 2+2`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## その他の出力オプション + +R MarkdownをPDFやWord文書に変換することもできます。 ドロップダウンメニューを表示させるために、「Knit HTML」の横にある小さい三角をクリックしましょう。 または、 `pdf_document` や `word_document` をファイルの最初のヘッダー(header)に入れておくこともできます。 + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:PDFドキュメントの作成 + +`.pdf` 文書を作成するためには、いくつかソフトウェアをインストールしなければいけないかもしれません。 The R +package `tinytex` provides some tools to help make this process easier for R users. +With `tinytex` installed, run `tinytex::install_tinytex()` to install the required +software (you'll only need to do this once) and then when you knit to pdf `tinytex` +will automatically detect and install any additional LaTeX packages that are needed to +produce the pdf document. Visit the [tinytex website](https://yihui.org/tinytex/) +for more information. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Visual markdown editing in RStudio + +RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like `**bold words**`) are +transformed to the formatted appearance (**bold words**) as you type. +This mode also includes a toolbar at the top with basic formatting buttons, +similar to what you might see in common word processing software programs. +You can turn visual editing on and off by pressing +the ![](fig/visual_mode_icon.png){alt='Icon for turning on and off the visual editing mode in RStudio, which looks like a pair of compasses'} +button in the top right corner of your R Markdown document. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## 資料 + +- [Knitr in a knutshell tutorial](https://kbroman.org/knitr_knutshell) +- [Dynamic Documents with R and knitr](https://www.amazon.com/exec/obidos/ASIN/1482203537/7210-20) (book) +- [R Markdown documentation](https://rmarkdown.rstudio.com) +- [R Markdown cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) +- [Getting started with R Markdown](https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/) +- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) (book by Rstudio team) +- [Reproducible Reporting](https://www.rstudio.com/resources/webinars/reproducible-reporting/) +- [The Ecosystem of R Markdown](https://www.rstudio.com/resources/webinars/the-ecosystem-of-r-markdown/) +- [Introducing Bookdown](https://www.rstudio.com/resources/webinars/introducing-bookdown/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Mix reporting written in R Markdown with software written in R. +- Specify chunk options to control formatting. +- Use `knitr` to convert these documents into PDF and other formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/episodes/15-wrap-up.Rmd b/locale/ja/episodes/15-wrap-up.Rmd new file mode 100644 index 000000000..01495eb91 --- /dev/null +++ b/locale/ja/episodes/15-wrap-up.Rmd @@ -0,0 +1,95 @@ +--- +title: Writing Good Software +teaching: 15 +exercises: 0 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe best practices for writing R and explain the justification for each. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write software that other people can use? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## プロジェクトフォルダーを構造化する + +Keep your project folder structured, organized and tidy, by creating subfolders for your code files, manuals, data, binaries, output plots, etc. It can be done completely manually, or with the help of RStudio's `New Project` functionality, or a designated package, such as `ProjectTemplate`. + +::::::::::::::::::::::::::::::::::::::::: callout + +## ヒント:解決策の一つ:ProjectTemplate + +One way to automate the management of projects is to install the third-party package, `ProjectTemplate`. +This package will set up an ideal directory structure for project management. +This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. +Together with the default RStudio project functionality and Git you will be able to keep track of your +work as well as be able to share your work with collaborators. + +1. Install `ProjectTemplate`. +2. Load the library +3. Initialise the project: + +```{r, eval=FALSE} +install.packages("ProjectTemplate") +library("ProjectTemplate") +create.project("../my_project_2", merge.strategy = "allow.non.conflict") +``` + +For more information on ProjectTemplate and its functionality visit the +home page [ProjectTemplate](https://projecttemplate.net/index.html) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## コードを読めるようにする + +The most important part of writing code is making it readable and understandable. +コードを書く際に一番重要なのが、読みやすくて理解のしやすくすることです。 他の誰が自分のコードを取り上げ、何をするものかを理解してもらわなければなりません。 多くの場合、この誰かさんは、6か月後の自分なので、もしコードが読めない、 理解できない場合は、昔の自分に悪態をつくことになるでしょう。 + +## 文書化:「どうやって」ではなく「なぜ」を伝えて下さい + +When you first start out, your comments will often describe what a command does, +since you're still learning yourself and it can help to clarify concepts and +remind you later. However, these comments aren't particularly useful later on +when you don't remember what problem your code is trying to solve. Try to also +include comments that tell you _why_ you're solving a problem, and _what_ problem +that is. The _how_ can come after that: it's an implementation detail you ideally +shouldn't have to worry about. + +## コードをモジュール化しましょう + +関数を分析スクリプトと分けて、別のファイルに保存しておき、 プロジェクトでRのセッションを開いたときに、 `source` することをオススメします。 このアプローチを取ると、分析スクリプトに無駄がなくなり、プロジェクト内のどの分析スクリプトでも 使える関数をストックする場所ができるので、便利です。 また、同じような関数をまとめるのが簡単になります。 + +## 問題をひとくち大に分ける + +初めは、問題解決と関数の記述は気が滅入るタスクで、 コードの経験不足とは別の問題として分けることができないと思うかもしれません。問題を消化できる塊に分け、 導入の詳細についての心配は後回しにしましょう: 問題をコードで解決できるまで、どんどん小さい関数に分けていきしましょう。 そして、解決されたものを積み上げていけば、問題を解決できるでしょう。 + +## コードが正しいことをしているか確かめましょう + +関数をテストすることを、くれぐれも忘れないように! + +## 同じことを繰り返さないようにしましょう + +Functions enable easy reuse within a project. 関数は、プロジェクトの中で簡単に再利用できます。もし、プロジェクトで同じようなコードの行の塊を 見つけたら、それらを関数に移す候補にしましょう。 + +もし、計算が一連の関数で行われていた場合、プロジェクトはよりモジュール化され、変更するのが簡単になります。 これは、特定のインプットが必ず特定のアウトプットを返す場合など、特にそうです。 + +## 常にスタイリッシュであろうとする + +自分のコードに一貫性のあるスタイルを適用しましょう。 + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Keep your project folder structured, organized and tidy. +- Document what and why, not how. +- Break programs into short single-purpose functions. +- Write re-runnable tests. +- 同じことを繰り返さないようにしましょう. +- Be consistent in naming, indentation, and other aspects of style. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/index.md b/locale/ja/index.md new file mode 100644 index 000000000..a530df345 --- /dev/null +++ b/locale/ja/index.md @@ -0,0 +1,30 @@ +--- +site: sandpaper::sandpaper_site +--- + +ブログラマーでない人のための gapminder データを用いた R 入門。 + +The goal of this lesson is to teach novice programmers to write modular code +and best practices for using R for data analysis. R is commonly used in many +scientific disciplines for statistical analysis and its array of third-party +packages. We find that many scientists who come to Software Carpentry workshops +use R and want to learn more. The emphasis of these materials is to give +attendees a strong foundation in the fundamentals of R, and to teach best +practices for scientific computing: breaking down analyses into modular units, +task automation, and encapsulation. + +このワークショップは、プログラミング言語 R の基礎を教えることが目的であり、 統計分析を教えることは含まれていないことに注意してください。 + +レッスンには、1日に教えることができる以上の素材が含まれています。 [講師ノートのページ]({{ page.root }}/guide)には、 1日または1日半のワークショップに適したレッスンプランがいくつかあります。 + +このワークショップでは、さまざまなサードパーティパッケージが使用されています。 これらは必ずしも最高だからという理由ではなく、汎用的という訳でもありませんが、有用であり、 主に使いやすさという観点から選ばれたパッケージです。 + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## 予備知識 + +コンピュータがデータと命令(プログラムやスクリプト等)をファイルに保存することを理解していること。 in files. +ファイルはディレクトリ(フォルダ)によって構成されていることを理解していること。 +パスを指定することによって、作業ディレクトリにないファイルにアクセスする方法を理解していること。 + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/ja/instructors/instructor-notes.md b/locale/ja/instructors/instructor-notes.md new file mode 100644 index 000000000..e6b480b85 --- /dev/null +++ b/locale/ja/instructors/instructor-notes.md @@ -0,0 +1,132 @@ +--- +title: Instructor Notes +--- + +## Timing + +Leave about 30 minutes at the start of each workshop and another 15 mins +at the start of each session for technical difficulties like WiFi and +installing things (even if you asked students to install in advance, longer if +not). + +## Lesson Plans + +レッスンには、1日に教えることができる以上の素材が含まれています。 +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course. + +Some suggested paths through the material are: + +(suggested by [@liz-is](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-276529213)) + +- RとRStudio入門 +- 04 Data Structures +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 08 Creating Publication-Quality Graphics with ggplot2 +- 10 Functions Explained +- 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +(suggested by [@naupaka](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-312547509)) + +- RとRStudio入門 +- 02 Project Management With RStudio +- 03 Seeking Help +- 04 Data Structures +- 05 Exploring Data Frames +- 06 Subsetting Data +- ベクトル化 +- 08 Creating Publication-Quality Graphics with ggplot2 _OR_ + 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +A half day course could consist of (suggested by [@karawoo](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-277599864)): + +- RとRStudio入門 +- 04 Data Structures (only creating vectors with `c()`) +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 06 Subsetting Data (excluding factor, matrix and list subsetting) +- 08 Creating Publication-Quality Graphics with ggplot2 + +## Setting up git in RStudio + +There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to +the Options window in the RStudio application. + +- **Mac OS X:** + - Go RStudio -> Preferences... -> Git/SVN + - Check and see whether there is a path to a file in the "Git executable" window. If not, the next challenge is figuring out where Git is located. + - In the terminal enter `which git` and you will get a path to the git executable. In the "Git executable" window you may have difficulties finding the directory since OS X hides many of the operating system files. While the file selection window is open, pressing "Command-Shift-G" will pop up a text entry box where you will be able to type or paste in the full path to your git executable: e.g. /usr/bin/git or whatever else it might be. +- **Windows:** + - Go Tools -> Global options... -> Git/SVN + - If you use the Software Carpentry Installer, then 'git.exe' should be installed at `C:/Program Files/Git/bin/git.exe`. + +To prevent the learners from having to re-enter their password each time they push a commit to GitHub, this command (which can be run from a bash prompt) will make it so they only have to enter their password once: + +```bash +$ git config --global credential.helper 'cache --timeout=10000000' +``` + +## RStudio Color Preview + +RStudio has a feature to preview the color for certain named colors and hexadecimal colors. This may confuse or distract learners (and instructors) who are not expecting it. + +Mainly, this is likely to come up during the episode on "Data Structures" with the following code block: + +```r +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_string = c(1, 0, 1)) +``` + +This option can be turned off and on in the following menu setting: +Tools -> Global Options -> Code -> Display -> Enable preview of named and hexadecimal colors (under "Syntax") + +## Pulling in Data + +The easiest way to get the data used in this lesson during a workshop is to have +attendees download the raw data from [gapminder-data] and +[gapminder-data-wide]. + +Attendees can use the `File - Save As` dialog in their browser to save the file. + +## Overall + +Make sure to emphasize good practices: put code in scripts, and make +sure they're version controlled. Encourage students to create script +files for challenges. + +If you're working in a cloud environment, get them to upload the +gapminder data after the second lesson. + +Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a +lot of the esoteric behaviour encountered in basic operations. + +Vector recycling and function stacks are probably best explained +with diagrams on a whiteboard. + +Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is tremendously +useful. + +Be sure to show the CRAN task views, look at one of the topics. + +There's a lot of content: move quickly through the earlier lessons. Their +extensiveness is mostly for purposes of learning by osmosis: so that their +memory will trigger later when they encounter a problem or some esoteric behaviour. + +Key lessons to take time on: + +- Data subsetting - conceptually difficult for novices +- Functions - learners especially struggle with this +- Data structures - worth being thorough, but you can go through it quickly. + +Don't worry about being correct or knowing the material back-to-front. Use +mistakes as teaching moments: the most vital skill you can impart is how to +debug and recover from unexpected errors. + +[gapminder-data]: data/gapminder_data.csv +[gapminder-data-wide]: data/gapminder_wide.csv diff --git a/locale/ja/learners/discuss.md b/locale/ja/learners/discuss.md new file mode 100644 index 000000000..dc9aa7dd7 --- /dev/null +++ b/locale/ja/learners/discuss.md @@ -0,0 +1,7 @@ +--- +title: 議論1 +--- + +Please see [our other R lesson][r-gap] for a different presentation of these concepts. + +[r-gap]: https://swcarpentry.github.io/r-novice-gapminder/ diff --git a/locale/ja/learners/reference.md b/locale/ja/learners/reference.md new file mode 100644 index 000000000..05d27e6ff --- /dev/null +++ b/locale/ja/learners/reference.md @@ -0,0 +1,342 @@ +--- +title: Reference +--- + +## Reference + +## [Introduction to R and RStudio](episodes/01-rstudio-intro.Rmd) + +- Use the escape key to cancel incomplete commands or running code + (Ctrl+C) if you're using R from the shell. +- Basic arithmetic operations follow standard order of precedence: + - Brackets: `(`, `)` + - 累乗:`^`か` ` + - 割る:`/` + - 掛ける:` ` + - 加える:`+` + - 引く:`-` +- Scientific notation is available, e.g: `2e-3` +- Anything to the right of a `#` is a comment, R will ignore this! +- Functions are denoted by `function_name()`. Expressions inside the + brackets are evaluated before being passed to the function, and + functions can be nested. +- Mathematical functions: `exp`, `sin`, `log`, `log10`, `log2` etc. +- Comparison operators: `<`, `<=`, `>`, `>=`, `==`, `!=` +- Use `all.equal` to compare numbers! +- `<-` is the assignment operator. Anything to the right is evaluate, then + stored in a variable named to the left. +- `ls` lists all variables and functions you've created +- `rm` can be used to remove them +- When assigning values to function arguments, you _must_ use `=`. + +## [Project management with RStudio](episodes/02-project-intro.Rmd) + +- To create a new project, go to File -> New Project +- Install the `packrat` package to create self-contained projects +- `install.packages` to install packages from CRAN +- `library` to load a package into R +- `packrat::status` to check whether all packages referenced in your + scripts have been installed. + +## [Seeking help](episodes/03-seeking-help.Rmd) + +- To access help for a function type `?function_name` or `help(function_name)` +- Use quotes for special operators e.g. `?"+"` +- Use fuzzy search if you can't remember a name '??search\_term' +- [CRAN task views](https://cran.at.r-project.org/web/views) are a good starting point. +- [Stack Overflow](https://stackoverflow.com/) is a good place to get help with your code. + - `?dput` will dump data you are working from so others can load it easily. + - `sessionInfo()` will give details of your setup that others may need for debugging. + +## [Data structures](episodes/04-data-structures-part1.Rmd) + +Individual values in R must be one of 5 **data types**, multiple values can be grouped in **data structures**. + +**データ型** + +- `typeof(object)` gives information about an items data type. + +- There are 5 main data types: + + - `?numeric` real (decimal) numbers + - `?integer` whole numbers only + - `?character` text + - `?complex` complex numbers + - `?logical` TRUE or FALSE values + + **Special types:** + + - `?NA` missing values + - `?NaN` "not a number" for undefined values (e.g. `0/0`). + - `?Inf`, `-Inf` infinity. + - `?NULL` a data structure that doesn't exist + + `NA` can occur in any atomic vector. `NaN`, and `Inf` can only + occur in complex, integer or numeric type vectors. Atomic vectors + are the building blocks for all other data structures. A `NULL` value + will occur in place of an entire data structure (but can occur as list + elements). + +**Basic data structures in R:** + +- atomic `?vector` (can only contain one type) +- `?list` (containers for other objects) +- `?data.frame` two dimensional objects whose columns can contain different types of data +- `?matrix` two dimensional objects that can contain only one type of data. +- `?factor` vectors that contain predefined categorical data. +- `?array` multi-dimensional objects that can only contain one type of data + +Remember that matrices are really atomic vectors underneath the hood, and that +data.frames are really lists underneath the hood (this explains some of the weirder +behaviour of R). + +**[Vectors](episodes/04-data-structures-part1.Rmd)** + +- `?vector()` All items in a vector must be the same type. +- Items can be converted from one type to another using _coercion_. +- The concatenate function 'c()' will append items to a vector. +- `seq(from=0, to=1, by=1)` will create a sequence of numbers. +- Items in a vector can be named using the `names()` function. + +**[Factors](episodes/04-data-structures-part1.Rmd)** + +- `?factor()` Factors are a data structure designed to store categorical data. +- `levels()` shows the valid values that can be stored in a vector of type factor. + +**[Lists](episodes/04-data-structures-part1.Rmd)** + +- `?list()` Lists are a data structure designed to store data of different types. + +**[Matrices](episodes/04-data-structures-part1.Rmd)** + +- `?matrix()` Matrices are a data structure designed to store 2-dimensional data. + +**[Data Frames](episodes/05-data-structures-part2.Rmd)** + +- `?data.frame` is a key data structure. It is a `list` of `vectors`. +- `cbind()` will add a column (vector) to a data.frame. +- `rbind()` will add a row (list) to a data.frame. + +**Useful functions for querying data structures:** + +- `?str` structure, prints out a summary of the whole data structure +- `?typeof` tells you the type inside an atomic vector +- `?class` what is the data structure? +- `?head` print the first `n` elements (rows for two-dimensional objects) +- `?tail` print the last `n` elements (rows for two-dimensional objects) +- `?rownames`, `?colnames`, `?dimnames` retrieve or modify the row names + and column names of an object. +- `?names` retrieve or modify the names of an atomic vector or list (or + columns of a data.frame). +- `?length` get the number of elements in an atomic vector +- `?nrow`, `?ncol`, `?dim` get the dimensions of a n-dimensional object + (Won't work on atomic vectors or lists). + +## [Exploring Data Frames](episodes/05-data-structures-part2.Rmd) + +- `read.csv` to read in data in a regular structure + - `sep` argument to specify the separator + - "," for comma separated + - "\\t" for tab separated + - Other arguments: + - `header=TRUE` if there is a header row + +## [Subsetting data](episodes/06-data-subsetting.Rmd) + +- Elements can be accessed by: + + - Index + - Name + - 論理演算子 + +- `[` single square brackets: + + - _extract_ single elements or _subset_ vectors + - e.g.`x[1]` extracts the first item from vector x. + - _extract_ single elements of a list. The returned value will be another `list()`. + - _extract_ columns from a data.frame + +- `[` with two arguments to: + + - _extract_ rows and/or columns of + - 行列 + - data.frames + - e.g. `x[1,2]` will extract the value in row 1, column 2. + - e.g. `x[2,:]` will extract the entire second column of values. + +- `[[` double square brackets to extract items from lists. + +- `$` to access columns or list elements by name + +- negative indices skip elements + +## [Control flow](episodes/07-control-flow.Rmd) + +- Use `if` condition to start a conditional statement, `else if` condition to provide + additional tests, and `else` to provide a default +- The bodies of the branches of conditional statements must be indented. +- Use `==` to test for equality. +- `%in%` will return a `TRUE`/`FALSE` indicating if there is a match between an element and a vector. +- `X && Y` is only true if both X and Y are `TRUE`. +- `X || Y` is true if either X or Y, or both, are `TRUE`. +- Zero is considered `FALSE`; all other numbers are considered `TRUE` +- Nest loops to operate on multi-dimensional data. + +## [Creating publication quality graphics](episodes/08-plot-ggplot2.Rmd) + +- figures can be created with the grammar of graphics: + - `library(ggplot2)` + - `ggplot` to create the base figure + - `aes`thetics specify the data axes, shape, color, and data size + - `geom`etry functions specify the type of plot, e.g. `point`, `line`, `density`, `box` + - `geom`etry functions also add statistical transforms, e.g. `geom_smooth` + - `scale` functions change the mapping from data to aesthetics + - `facet` functions stratify the figure into panels + - `aes`thetics apply to individual layers, or can be set for the whole plot + inside `ggplot`. + - `theme` functions change the overall look of the plot + - order of layers matters! + - `ggsave` to save a figure. + +## [ベクトル化]({}/09-vectorization/) + +- - ほとんどの関数や演算はベクトルの要素ごとに実行されます。 +- - ` ` は行列に対して要素ごとに実行されます。 +- - 本来の行列の積を求めるには `% %` を使います。 +- - `any()` はベクトルの要素の中に一つでも `TRUE` があれば `TRUE` を返します。 +- - `all()` はベクトルの要素が 全て `TRUE` であれば `TRUE` を返します。 + +## [関数について]({}/10-functions/) + +- `?"関数"` +- Put code whose parameters change frequently in a function, then call it with + different parameter values to customize its behavior. +- The last line of a function is returned, or you can use `return` explicitly +- Any code written in the body of the function will preferably look for variables defined inside the function. +- Document Why, then What, then lastly How (if the code isn't self explanatory) + +## [データの出力]({}/11-writing-data/) + +- - `write.table` を使ってオブジェクトを一般的な形式に出力しましょう。 +- - `quote=FALSE` を指定して文字列が引用符に囲われないようにしましょう。 + +## [Dataframe manipulation with dplyr](episodes/12-dplyr.Rmd) + +- `library(dplyr)` +- `?select` to extract variables by name. +- `?filter` return rows with matching conditions. +- `?group_by` group data by one of more variables. +- `?summarize` summarize multiple values to a single value. +- `?mutate` add new variables to a data.frame. +- Combine operations using the `?"%>%"` pipe operator. + +## [Dataframe manipulation with tidyr](episodes/13-tidyr.Rmd) + +- `library(tidyr)` +- `?pivot_longer` convert data from _wide_ to _long_ format. +- `?pivot_wider` convert data from _long_ to _wide_ format. +- `?separate` split a single value into multiple values. +- `?unite` merge multiple values into a single value. + +## [Producing reports with knitr](episodes/14-knitr-markdown.Rmd) + +- Value of reproducible reports +- Basics of Markdown +- Rコードの「チャンク(塊)」 +- チャンクのオプション +- 文中のRコード +- その他の出力オプション + +## [Best practices for writing good code](episodes/15-wrap-up.Rmd) + +- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. +- Write tests before writing code in order to help determine exactly what that code is supposed to do. +- Know what code is supposed to do before trying to debug it. +- Make it fail every time. +- Make it fail fast. +- Change one thing at a time, and for a reason. +- Keep track of what you've done. +- Be humble + +## Glossary + +[argument]{#argument} +: A value given to a function or program when it runs. +The term is often used interchangeably (and inconsistently) with [parameter](#parameter). + +[assign]{#assign} +: To give a value a name by associating a variable with it. + +[body]{#body} +: (of a function): the statements that are executed when a function runs. + +[comment]{#comment} +: A remark in a program that is intended to help human readers understand what is going on, +but is ignored by the computer. +Comments in Python, R, and the Unix shell start with a `#` character and run to the end of the line; +comments in SQL start with `--`, +and other languages have other conventions. + +[comma-separated values]{#comma-separated-values} +: (CSV) A common textual representation for tables +in which the values in each row are separated by commas. + +[delimiter]{#delimiter} +: A character or characters used to separate individual values, +such as the commas between columns in a [CSV](#comma-separated-values) file. + +[documentation]{#documentation} +: Human-language text written to explain what software does, +how it works, or how to use it. + +[floating-point number]{#floating-point-number} +: A number containing a fractional part and an exponent. +See also: [integer](#integer). + +[for loop]{#for-loop} +: A loop that is executed once for each value in some kind of set, list, or range. +See also: [while loop](#while-loop). + +[index]{#index} +: A subscript that specifies the location of a single value in a collection, +such as a single pixel in an image. + +[integer]{#integer} +: A whole number, such as -12343. See also: [floating-point number](#floating-point-number). + +[library]{#library} +: In R, the directory(ies) where [packages](#package) are stored. + +[package]{#package} +: A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a [library](#library) and loaded using the library() function. + +[parameter]{#parameter} +: A variable named in the function's declaration that is used to hold a value passed into the call. +The term is often used interchangeably (and inconsistently) with [argument](#argument). + +[return statement]{#return-statement} +: A statement that causes a function to stop executing and return a value to its caller immediately. + +[sequence]{#sequence} +: A collection of information that is presented in a specific order. + +[shape]{#shape} +: An array's dimensions, represented as a vector. +For example, a 5×3 array's shape is `(5,3)`. + +[string]{#string} +: Short for "character string", +a [sequence](#sequence) of zero or more characters. + +[syntax error]{#syntax-error} +: A programming error that occurs when statements are in an order or contain characters +not expected by the programming language. + +[type]{#type} +: The classification of something in a program (for example, the contents of a variable) +as a kind of number (e.g. [floating-point number](#floating-point-number), [integer](#integer)), [string](#string), +or something else. In R the command typeof() is used to query a variables type. + +[while loop]{#while-loop} +: A loop that keeps executing as long as some condition is true. +See also: [for loop](#for-loop). diff --git a/locale/ja/learners/setup.md b/locale/ja/learners/setup.md new file mode 100644 index 000000000..dc9acf933 --- /dev/null +++ b/locale/ja/learners/setup.md @@ -0,0 +1,8 @@ +--- +title: Setup +--- + +このレッスンは、あなたのコンピューターにRとRStudioがインストールされていることを前提としています。 + +- [ここからRの最新バージョンをダウンロード及びインストール下さい](https://www.r-project.org/). +- [ここからRStudioをダウンロード及びインストール下さい](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features. あなたのコンピューターにはデスクトップバージョンが必要です。 diff --git a/locale/ja/profiles/learner-profiles.md b/locale/ja/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/ja/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here. diff --git a/locale/uk/CODE_OF_CONDUCT.md b/locale/uk/CODE_OF_CONDUCT.md new file mode 100644 index 000000000..a820b8df5 --- /dev/null +++ b/locale/uk/CODE_OF_CONDUCT.md @@ -0,0 +1,12 @@ +--- +title: Contributor Code of Conduct +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/locale/uk/CONTRIBUTING.md b/locale/uk/CONTRIBUTING.md new file mode 100644 index 000000000..d29e890c5 --- /dev/null +++ b/locale/uk/CONTRIBUTING.md @@ -0,0 +1,122 @@ +## Contributing + +[The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data +Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source +projects, and we welcome contributions of all kinds: new lessons, fixes to +existing material, bug reports, and reviews of proposed changes are all +welcome. + +### Contributor Agreement + +By contributing, you agree that we may redistribute your work under our +license. In exchange, we will address your issues and/or assess +your change proposal as promptly as we can, and help you become a member of our +community. Everyone involved in [The Carpentries][cp-site] agrees to abide by +our [code of conduct](CODE_OF_CONDUCT.md). + +### How to Contribute + +The easiest way to get started is to file an issue to tell us about a spelling +mistake, some awkward wording, or a factual error. This is a good way to +introduce yourself and to meet some of our community members. + +1. If you do not have a [GitHub][github] account, you can [send us comments by + email][contact]. However, we will be able to respond more quickly if you use + one of the other methods described below. + +2. If you have a [GitHub][github] account, or are willing to [create + one][github-join], but do not know how to use Git, you can report problems + or suggest improvements by [creating an issue][repo-issues]. This allows us + to assign the item to someone and to respond to it in a threaded discussion. + +3. If you are comfortable with Git, and would like to add or change material, + you can submit a pull request (PR). Instructions for doing this are + [included below](#using-github). For inspiration about changes that need to + be made, check out the [list of open issues][issues] across the Carpentries. + +Note: if you want to build the website locally, please refer to [The Workbench +documentation][template-doc]. + +### Where to Contribute + +1. If you wish to change this lesson, add issues and pull requests here. +2. If you wish to change the template used for workshop websites, please refer + to [The Workbench documentation][template-doc]. + +### What to Contribute + +There are many ways to contribute, from writing new exercises and improving +existing ones to updating or filling in the documentation and submitting [bug +reports][issues] about things that do not work, are not clear, or are missing. +If you are looking for ideas, please see [the list of issues for this +repository][repo-issues], or the issues for [Data Carpentry][dc-issues], +[Library Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects. + +Comments on issues and reviews of pull requests are just as welcome: we are +smarter together than we are on our own. **Reviews from novices and newcomers +are particularly valuable**: it's easy for people who have been using these +lessons for a while to forget how impenetrable some of this material can be, so +fresh eyes are always welcome. + +### What _Not_ to Contribute + +Our lessons already contain more material than we can cover in a typical +workshop, so we are usually _not_ looking for more concepts or tools to add to +them. As a rule, if you want to introduce a new idea, you must (a) estimate how +long it will take to teach and (b) explain what you would take out to make room +for it. The first encourages contributors to be honest about requirements; the +second, to think hard about priorities. + +We are also not looking for exercises or other material that only run on one +platform. Our workshops typically contain a mixture of Windows, macOS, and +Linux users; in order to be usable, our lessons must run equally well on all +three. + +### Using GitHub + +If you choose to contribute via GitHub, you may want to look at [How to +Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we +use [GitHub flow][github-flow] to manage changes: + +1. Create a new branch in your desktop copy of this repository for each + significant change. +2. Commit the change in that branch. +3. Push that branch to your fork of this repository on GitHub. +4. Submit a pull request from that branch to the [upstream repository][repo]. +5. If you receive feedback, make changes on your desktop and push to your + branch on GitHub: the pull request will update automatically. + +NB: The published copy of the lesson is usually in the `main` branch. + +Each lesson has a team of maintainers who review issues and pull requests or +encourage others to do so. The maintainers are community volunteers, and have +final say over what gets merged into the lesson. + +### Other Resources + +The Carpentries is a global organisation with volunteers and learners all over +the world. We share values of inclusivity and a passion for sharing knowledge, +teaching and learning. There are several ways to connect with The Carpentries +community listed at \ including via social +media, slack, newsletters, and email lists. You can also [reach us by +email][contact]. + +[repo]: https://github.com/swcarpentry/r-novice-gapminder +[repo-issues]: https://github.com/swcarpentry/r-novice-gapminder/issues +[contact]: mailto:team@carpentries.org +[cp-site]: https://carpentries.org/ +[dc-issues]: https://github.com/issues?q=user%3Adatacarpentry +[dc-lessons]: https://datacarpentry.org/lessons/ +[dc-site]: https://datacarpentry.org/ +[discuss-list]: https://lists.software-carpentry.org/listinfo/discuss +[github]: https://github.com +[github-flow]: https://guides.github.com/introduction/flow/ +[github-join]: https://github.com/join +[how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github +[issues]: https://carpentries.org/help-wanted-issues/ +[lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry +[swc-issues]: https://github.com/issues?q=user%3Aswcarpentry +[swc-lessons]: https://software-carpentry.org/lessons/ +[swc-site]: https://software-carpentry.org/ +[lc-site]: https://librarycarpentry.org/ +[template-doc]: https://carpentries.github.io/workbench/ diff --git a/locale/uk/LICENSE.md b/locale/uk/LICENSE.md new file mode 100644 index 000000000..513ad8f83 --- /dev/null +++ b/locale/uk/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: Licenses +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to \), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +- You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +- No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/locale/uk/README.md b/locale/uk/README.md new file mode 100644 index 000000000..19341cede --- /dev/null +++ b/locale/uk/README.md @@ -0,0 +1,6 @@ +# Internationalisation hub repository for Software Carpentry R for Reproducible Scientific Analysis + +An introduction to R for non-programmers using the [Gapminder][gapminder] data. +Please see [https://swcarpentry.github.io/r-novice-gapminder](https://swcarpentry.github.io/r-novice-gapminder) for a rendered version of this material in English. + +More info to follow. diff --git a/locale/uk/config.yaml b/locale/uk/config.yaml new file mode 100644 index 000000000..e8310df81 --- /dev/null +++ b/locale/uk/config.yaml @@ -0,0 +1,71 @@ +#------------------------------------------------------------ +#Values for this lesson. +#------------------------------------------------------------ +#Which carpentry is this (swc, dc, lc, or cp)? +#swc: Software Carpentry +#dc: Data Carpentry +#lc: Library Carpentry +#cp: Carpentries (to use for instructor training for instance) +#incubator: The Carpentries Incubator +carpentry: 'swc' +#Overall title for pages. +title: 'R for Reproducible Scientific Analysis' +#Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: '2015-04-18' +#Comma-separated list of keywords for the lesson +keywords: 'software, data, lesson, The Carpentries' +#Life cycle stage of the lesson +#possible values: pre-alpha, alpha, beta, stable +life_cycle: 'stable' +#License of the lesson materials (recommended CC-BY 4.0) +license: 'CC-BY 4.0' +#Link to the source repository for this lesson +source: 'https://github.com/swcarpentry/r-novice-gapminder' +#Default branch of your lesson +branch: 'main' +#Who to contact if there are any issues +contact: 'team@carpentries.org' +#Navigation ------------------------------------------------ +#Use the following menu items to specify the order of +#individual pages in each dropdown section. Leave blank to +#include all pages in the folder. +#Example ------------- +#episodes: +#- introduction.md +#- first-steps.md +#learners: +#- setup.md +#instructors: +#- instructor-notes.md +#profiles: +#- one-learner.md +#- another-learner.md +#Order of episodes in your lesson +episodes: + - 01-rstudio-intro.Rmd + - 02-project-intro.Rmd + - 03-seeking-help.Rmd + - 04-data-structures-part1.Rmd + - 05-data-structures-part2.Rmd + - 06-data-subsetting.Rmd + - 07-control-flow.Rmd + - 08-plot-ggplot2.Rmd + - 09-vectorization.Rmd + - 10-functions.Rmd + - 11-writing-data.Rmd + - 12-dplyr.Rmd + - 13-tidyr.Rmd + - 14-knitr-markdown.Rmd + - 15-wrap-up.Rmd +#Information for Learners +learners: +#Information for Instructors +instructors: +#Learner Profiles +profiles: +#Customisation --------------------------------------------- +#This space below is where custom yaml items (e.g. pinning +#sandpaper and varnish versions) should live +url: 'https://swcarpentry.github.io/r-novice-gapminder' +analytics: carpentries +lang: en diff --git a/locale/uk/episodes/01-rstudio-intro.Rmd b/locale/uk/episodes/01-rstudio-intro.Rmd new file mode 100644 index 000000000..3949440a1 --- /dev/null +++ b/locale/uk/episodes/01-rstudio-intro.Rmd @@ -0,0 +1,722 @@ +--- +title: Introduction to R and RStudio +teaching: 45 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe the purpose and use of each pane in RStudio +- Locate buttons and options in RStudio +- Define a variable +- Assign data to a variable +- Manage a workspace in an interactive R session +- Use mathematical and comparison operators +- Call functions +- Manage packages + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How to find your way around RStudio? +- How to interact with R? +- How to manage your environment? +- How to install packages? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Before Starting The Workshop + +Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date. + +- [Download and install the latest version of R here](https://www.r-project.org/) +- [Download and install RStudio here](https://www.rstudio.com/products/rstudio/download/#download) + +## Why use R and R studio? + +Welcome to the R portion of the Software Carpentry workshop! + +Science is a multi-step process: once you've designed an experiment and collected +data, the real fun begins with analysis! Throughout this lesson, we're going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier. + +Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to ["reproducible" research](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285). + +Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program. + +However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development +Environment (IDE). It provides a built-in editor, works on all platforms (including +on servers) and provides many advantages such as integration with version +control and project management. + +## Overview + +We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. This example starts with a dataset from [gapminder.org](https://www.gapminder.org) containing population information for many +countries through time. Can you read the data into R? Can you plot the population for +Senegal? Can you calculate the average income for countries on the continent of Asia? +By the end of these lessons you will be able to do things like plot the populations +for all of these countries in under a minute! + +**Basic layout** + +When you first open RStudio, you will be greeted by three panels: + +- The interactive R console/Terminal (entire left) +- Environment/History/Connections (tabbed in upper right) +- Files/Plots/Packages/Help/Viewer (tabbed in lower right) + +![](fig/01-rstudio.png){alt='RStudio layout'} + +Once you open files, such as R scripts, an editor panel will also open +in the top left. + +![](fig/01-rstudio-script.png){alt='RStudio layout with .R file open'} + +::::::::::::::::::::::::::::::::::::::::: callout + +## R scripts + +Any commands that you write in the R console can be saved to a file +to be re-run again. Files containing R code to be ran in this way are +called R scripts. R scripts have `.R` at the end of their names to +let you know what they are. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Workflow within RStudio + +There are two main ways one can work within RStudio: + +1. Test and play within the interactive R console then copy code into + a .R file to run later. + +- This works well when doing small tests and initially starting off. +- It quickly becomes laborious + +2. Start writing in a .R file and use RStudio's short cut keys for the Run command + to push the current line, selected lines or modified lines to the + interactive R console. + +- This is a great way to start; all your code is saved for later +- You will be able to run the file you create from within RStudio + or using R's `source()` function. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running segments of your code + +RStudio offers you great flexibility in running code from within the editor +window. There are buttons, menu choices, and keyboard shortcuts. To run the +current line, you can + +1. click on the `Run` button above the editor panel, or +2. select "Run Lines" from the "Code" menu, or +3. hit Ctrl\+Return in Windows or Linux + or \+Return on OS X. + (This shortcut can also be seen by hovering + the mouse over the button). To run a block of code, select it and then `Run`. + If you have modified a line of code within a block of code you have just run, + there is no need to reselect the section and `Run`, you can use the next button + along, `Re-run the previous region`. This will run the previous code block + including the modifications you have made. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Introduction to R + +Much of your time in R will be spent in the R interactive +console. This is where you will run all of your code, and can be a +useful environment to try out ideas before adding them to an R script +file. This console in RStudio is the same as the one you would get if +you typed in `R` in your command-line environment. + +The first thing you will see in the R interactive session is a bunch +of information, followed by a ">" and a blinking cursor. In many ways +this is similar to the shell environment you learned about during the +shell lessons: it operates on the same idea of a "Read, evaluate, +print loop": you type in commands, R tries to execute them, and then +returns a result. + +## Using R as a calculator + +The simplest thing you could do with R is to do arithmetic: + +```{r} +1 + 100 +``` + +And R will print out the answer, with a preceding "[1]". [1] is the index of +the first element of the line being printed in the console. For more information +on indexing vectors, see [Episode 6: Subsetting Data](https://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting/index.html). + +If you type in an incomplete command, R will wait for you to +complete it. If you are familiar with Unix Shell's bash, you may recognize this behavior from bash. + +```r +> 1 + +``` + +```output ++ +``` + +Any time you hit return and the R session shows a "+" instead of a ">", it +means it's waiting for you to complete the command. If you want to cancel +a command you can hit Esc and RStudio will give you back the ">" prompt. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Canceling commands + +If you're using R from the command line instead of from within RStudio, +you need to use Ctrl\+C instead of Esc +to cancel the command. This applies to Mac users as well! + +Canceling a command isn't only useful for killing incomplete commands: +you can also use it to tell R to stop running code (for example if it's +taking much longer than you expect), or to get rid of the code you're +currently writing. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +When using R as a calculator, the order of operations is the same as you +would have learned back in school. + +From highest to lowest precedence: + +- Parentheses: `(`, `)` +- Exponents: `^` or `**` +- Multiply: `*` +- Divide: `/` +- Add: `+` +- Subtract: `-` + +```{r} +3 + 5 * 2 +``` + +Use parentheses to group operations in order to force the order of +evaluation if it differs from the default, or to make clear what you +intend. + +```{r} +(3 + 5) * 2 +``` + +This can get unwieldy when not needed, but clarifies your intentions. +Remember that others may later read your code. + +```{r, eval=FALSE} +(3 + (5 * (2 ^ 2))) # hard to read +3 + 5 * 2 ^ 2 # clear, if you remember the rules +3 + 5 * (2 ^ 2) # if you forget some rules, this might help +``` + +The text after each line of code is called a +"comment". Anything that follows after the hash (or octothorpe) symbol +`#` is ignored by R when it executes code. + +Really small or large numbers get a scientific notation: + +```{r} +2/10000 +``` + +Which is shorthand for "multiplied by `10^XX`". So `2e-4` +is shorthand for `2 * 10^(-4)`. + +You can write numbers in scientific notation too: + +```{r} +5e3 # Note the lack of minus here +``` + +## Mathematical functions + +R has many built in mathematical functions. To call a function, +we can type its name, followed by open and closing parentheses. +Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example: + +```{r, eval=FALSE} +getwd() #returns an absolute filepath +``` + +doesn't require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result. + +```{r} +sin(1) # trigonometry functions +``` + +```{r} +log(1) # natural logarithm +``` + +```{r} +log10(10) # base-10 logarithm +``` + +```{r} +exp(0.5) # e^(1/2) +``` + +Don't worry about trying to remember every function in R. You +can look them up on Google, or if you can remember the +start of the function's name, use the tab completion in RStudio. + +This is one advantage that RStudio has over R on its own, it +has auto-completion abilities that allow you to more easily +look up functions, their arguments, and the values that they +take. + +Typing a `?` before the name of a command will open the help page +for that command. When using RStudio, this will open the 'Help' pane; +if using R in the terminal, the help page will open in your browser. +The help page will include a detailed description of the command and +how it works. Scrolling to the bottom of the help page will usually +show a collection of code examples which illustrate command usage. +We'll go through an example later. + +## Comparing things + +We can also do comparisons in R: + +```{r} +1 == 1 # equality (note two equals signs, read as "is equal to") +``` + +```{r} +1 != 2 # inequality (read as "is not equal to") +``` + +```{r} +1 < 2 # less than +``` + +```{r} +1 <= 1 # less than or equal to +``` + +```{r} +1 > 0 # greater than +``` + +```{r} +1 >= -9 # greater than or equal to +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Comparing Numbers + +A word of warning about comparing numbers: you should +never use `==` to compare two numbers unless they are +integers (a data type which can specifically represent +only whole numbers). + +Computers may only represent decimal numbers with a +certain degree of precision, so two numbers which look +the same when printed out by R, may actually have +different underlying representations and therefore be +different by a small margin of error (called Machine +numeric tolerance). + +Instead you should use the `all.equal` function. + +Further reading: [http://floating-point-gui.de/](https://floating-point-gui.de/) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Variables and assignment + +We can store values in variables using the assignment operator `<-`, like this: + +```{r} +x <- 1/40 +``` + +Notice that assignment does not print a value. Instead, we stored it for later +in something called a **variable**. `x` now contains the **value** `0.025`: + +```{r} +x +``` + +More precisely, the stored value is a _decimal approximation_ of +this fraction called a [floating point number](https://en.wikipedia.org/wiki/Floating_point). + +Look for the `Environment` tab in the top right panel of RStudio, and you will see that `x` and its value +have appeared. Our variable `x` can be used in place of a number in any calculation that expects a number: + +```{r} +log(x) +``` + +Notice also that variables can be reassigned: + +```{r} +x <- 100 +``` + +`x` used to contain the value 0.025 and now it has the value 100. + +Assignment values can contain the variable being assigned to: + +```{r} +x <- x + 1 #notice how RStudio updates its description of x on the top right tab +y <- x * 2 +``` + +The right hand side of the assignment can be any valid R expression. +The right hand side is _fully evaluated_ before the assignment occurs. + +Variable names can contain letters, numbers, underscores and periods but no spaces. They +must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). +Variables beginning with a period are hidden variables. +Different people use different conventions for long variable names, these include + +- periods.between.words +- underscores\_between\_words +- camelCaseToSeparateWords + +What you use is up to you, but **be consistent**. + +It is also possible to use the `=` operator for assignment: + +```{r} +x = 1/40 +``` + +But this is much less common among R users. The most important thing is to +**be consistent** with the operator you use. There are occasionally places +where it is less confusing to use `<-` than `=`, and it is the most common +symbol used in the community. So the recommendation is to use `<-`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Which of the following are valid R variable names? + +```{r, eval=FALSE} +min_height +max.height +_age +.mass +MaxLength +min-length +2widths +celsius2kelvin +``` + +::::::::::::::: solution + +## Solution to challenge 1 + +The following can be used as R variables: + +```{r ch1pt1-sol, eval=FALSE} +min_height +max.height +MaxLength +celsius2kelvin +``` + +The following creates a hidden variable: + +```{r ch1pt2-sol, eval=FALSE} +.mass +``` + +The following will not be able to be used to create a variable + +```{r ch1pt3-sol, eval=FALSE} +_age +min-length +2widths +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Vectorization + +One final thing to be aware of is that R is _vectorized_, meaning that +variables and functions can have vectors as values. In contrast to physics and +mathematics, a vector in R describes a set of values in a certain order of the +same data type. For example: + +```{r} +1:5 +2^(1:5) +x <- 1:5 +2^x +``` + +This is incredibly powerful; we will discuss this further in an +upcoming lesson. + +## Managing your environment + +There are a few useful commands you can use to interact with the R session. + +`ls` will list all of the variables and functions stored in the global environment +(your working R session): + +```{r, eval=FALSE} +ls() +``` + +```{r, echo=FALSE} +# If `ls()` is left to run by itself when rendering this Rmd document (as would +# happen if the code chunk above was evaluated), the output would contain extra +# items ("args", "dest_md", "op", "src_md") that people following the lesson +# would not see in their own session. +# +# This probably comes from the way the md episodes are generated when the +# lesson website is built. The solution below uses a temporary environment to +# mimick what the learners should observe when running `ls()` on their +# machines. + +temp.env <- new.env() +temp.env$x <- x +temp.env$y <- y +ls(temp.env) +rm(temp.env) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: hidden objects + +Like in the shell, `ls` will hide any variables or functions starting +with a "." by default. To list all objects, type `ls(all.names=TRUE)` +instead + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Note here that we didn't give any arguments to `ls`, but we still +needed to give the parentheses to tell R to call the function. + +If we type `ls` by itself, R prints a bunch of code instead of a listing of objects. + +```{r} +ls +``` + +What's going on here? + +Like everything in R, `ls` is the name of an object, and entering the name of +an object by itself prints the contents of the object. The object `x` that we +created earlier contains `r x`: + +```{r} +x +``` + +The object `ls` contains the R code that makes the `ls` function work! We'll talk +more about how functions work and start writing our own later. + +You can use `rm` to delete objects you no longer need: + +```{r, eval=FALSE} +rm(x) +``` + +If you have lots of things in your environment and want to delete all of them, +you can pass the results of `ls` to the `rm` function: + +```{r, eval=FALSE} +rm(list = ls()) +``` + +In this case we've combined the two. Like the order of operations, anything +inside the innermost parentheses is evaluated first, and so on. + +In this case we've specified that the results of `ls` should be used for the +`list` argument in `rm`. When assigning values to arguments by name, you _must_ +use the `=` operator!! + +If instead we use `<-`, there will be unintended side effects, or you may get an error message: + +```{r, error=TRUE} +rm(list <- ls()) +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Warnings vs. Errors + +Pay attention when R does something unexpected! Errors, like above, +are thrown when R cannot proceed with a calculation. Warnings on the +other hand usually mean that the function has run, but it probably +hasn't worked as expected. + +In both cases, the message that R prints out usually give you clues +how to fix a problem. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## R Packages + +It is possible to add functions to R by writing a package, or by +obtaining a package written by someone else. As of this writing, there +are over 10,000 packages available on CRAN (the comprehensive R archive +network). R and RStudio have functionality for managing packages: + +- You can see what packages are installed by typing + `installed.packages()` +- You can install packages by typing `install.packages("packagename")`, + where `packagename` is the package name, in quotes. +- You can update installed packages by typing `update.packages()` +- You can remove a package with `remove.packages("packagename")` +- You can make a package available for use with `library(packagename)` + +Packages can also be viewed, loaded, and detached in the Packages tab of the lower right panel in RStudio. Clicking on this tab will display all of the installed packages with a checkbox next to them. If the box next to a package name is checked, the package is loaded and if it is empty, the package is not loaded. Click an empty box to load that package and click a checked box to detach that package. + +Packages can be installed and updated from the Package tab with the Install and Update buttons at the top of the tab. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +What will be the value of each variable after each +statement in the following program? + +```{r, eval=FALSE} +mass <- 47.5 +age <- 122 +mass <- mass * 2.3 +age <- age - 20 +``` + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r ch2pt1-sol} +mass <- 47.5 +``` + +This will give a value of `r mass` for the variable mass + +```{r ch2pt2-sol} +age <- 122 +``` + +This will give a value of `r age` for the variable age + +```{r ch2pt3-sol} +mass <- mass * 2.3 +``` + +This will multiply the existing value of `r mass/2.3` by 2.3 to give a new value of +`r mass` to the variable mass. + +```{r ch2pt4-sol} +age <- age - 20 +``` + +This will subtract 20 from the existing value of `r age + 20 ` to give a new value +of `r age` to the variable age. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Run the code from the previous challenge, and write a command to +compare mass to age. Is mass larger than age? + +::::::::::::::: solution + +## Solution to challenge 3 + +One way of answering this question in R is to use the `>` to set up the following: + +```{r ch3-sol} +mass > age +``` + +This should yield a boolean value of TRUE since `r mass` is greater than `r age`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Clean up your working environment by deleting the mass and age +variables. + +::::::::::::::: solution + +## Solution to challenge 4 + +We can use the `rm` command to accomplish this task + +```{r ch4-sol} +rm(age, mass) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Install the following packages: `ggplot2`, `plyr`, `gapminder` + +::::::::::::::: solution + +## Solution to challenge 5 + +We can use the `install.packages()` command to install the required packages. + +```{r ch5-sol, eval=FALSE} +install.packages("ggplot2") +install.packages("plyr") +install.packages("gapminder") +``` + +An alternate solution, to install multiple packages with a single `install.packages()` command is: + +```{r ch5-sol2, eval=FALSE} +install.packages(c("ggplot2", "plyr", "gapminder")) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +When installing ggplot2, it may be required for some users to use the dependencies flag as a result of lazy loading affecting the install. This suggestion is not tied to any known bug discussion, and is advised based off instructor feedback/experience in resolving stochastic occurences of errors identified through delivery of this workshop: + +```{r ch5-sol3, eval=FALSE} +install.packages("ggplot2", dependencies = TRUE) +``` + +::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to write and run R programs. +- R has the usual arithmetic operators and mathematical functions. +- Use `<-` to assign values to variables. +- Use `ls()` to list the variables in a program. +- Use `rm()` to delete objects in a program. +- Use `install.packages()` to install packages (libraries). + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/02-project-intro.Rmd b/locale/uk/episodes/02-project-intro.Rmd new file mode 100644 index 000000000..74f964f40 --- /dev/null +++ b/locale/uk/episodes/02-project-intro.Rmd @@ -0,0 +1,259 @@ +--- +title: Project Management With RStudio +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Create self-contained projects in RStudio + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manage my projects in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Introduction + +The scientific process is naturally incremental, and many projects +start life as random notes, some code, then a manuscript, and +eventually everything is a bit mixed together. + + + + +Most people tend to organize their projects like this: + +![](fig/bad_layout.png){alt='Screenshot of file manager demonstrating bad project organisation'} + +There are many reasons why we should _ALWAYS_ avoid this: + +1. It is really hard to tell which version of your data is + the original and which is the modified; +2. It gets really messy because it mixes files with various + extensions together; +3. It probably takes you a lot of time to actually find + things, and relate the correct figures to the exact code + that has been used to generate it; + +A good project layout will ultimately make your life easier: + +- It will help ensure the integrity of your data; +- It makes it simpler to share your code with someone else + (a lab-mate, collaborator, or supervisor); +- It allows you to easily upload your code with your manuscript submission; +- It makes it easier to pick the project back up after a break. + +## A possible solution + +Fortunately, there are tools and packages which can help you manage your work effectively. + +One of the most powerful and useful aspects of RStudio is its project management +functionality. We'll be using this today to create a self-contained, reproducible +project. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1: Creating a self-contained project + +We're going to create a new project in RStudio: + +1. Click the "File" menu button, then "New Project". +2. Click "New Directory". +3. Click "New Project". +4. Type in the name of the directory to store your project, e.g. "my\_project". +5. If available, select the checkbox for "Create a git repository." +6. Click the "Create Project" button. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The simplest way to open an RStudio project once it has been created is to click +through your file system to get to the directory where it was saved and double +click on the `.Rproj` file. This will open RStudio and start your R session in the +same directory as the `.Rproj` file. All your data, plots and scripts will now be +relative to the project directory. RStudio projects have the added benefit of +allowing you to open multiple projects at the same time each open to its own +project directory. This allows you to keep multiple projects open without them +interfering with each other. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2: Opening an RStudio project through the file system + +1. Exit RStudio. +2. Navigate to the directory where you created a project in Challenge 1. +3. Double click on the `.Rproj` file in that directory. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Best practices for project organization + +Although there is no "best" way to lay out a project, there are some general +principles to adhere to that will make project management easier: + +### Treat data as read only + +This is probably the most important goal of setting up a project. Data is +typically time consuming and/or expensive to collect. Working with them +interactively (e.g., in Excel) where they can be modified means you are never +sure of where the data came from, or how it has been modified since collection. +It is therefore a good idea to treat your data as "read-only". + +### Data Cleaning + +In many cases your data will be "dirty": it will need significant preprocessing +to get into a format R (or any other programming language) will find useful. +This task is sometimes called "data munging". Storing these scripts in a +separate folder, and creating a second "read-only" data folder to hold the +"cleaned" data sets can prevent confusion between the two sets. + +### Treat generated output as disposable + +Anything generated by your scripts should be treated as disposable: it should +all be able to be regenerated from your scripts. + +There are lots of different ways to manage this output. Having an output folder +with different sub-directories for each separate analysis makes it easier later. +Since many analyses are exploratory and don't end up being used in the final +project, and some of the analyses get shared between projects. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Good Enough Practices for Scientific Computing + +[Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) gives the following recommendations for project organization: + +1. Put each project in its own directory, which is named after the project. +2. Put text documents associated with the project in the `doc` directory. +3. Put raw data and metadata in the `data` directory, and files generated during cleanup and analysis in a `results` directory. +4. Put source for the project's scripts and programs in the `src` directory, and programs brought in from elsewhere or compiled locally in the `bin` directory. +5. Name all files to reflect their content or function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Separate function definition and application + +One of the more effective ways to work with R is to start by writing the code you want to run directly in a .R script, and then running the selected lines (either using the keyboard shortcuts in RStudio or clicking the "Run" button) in the interactive R console. + +When your project is in its early stages, the initial .R script file usually contains many lines +of directly executed code. As it matures, reusable chunks get pulled into their +own functions. It's a good idea to separate these functions into two separate folders; one +to store useful functions that you'll reuse across analyses and projects, and +one to store the analysis scripts. + +### Save the data in the data directory + +Now we have a good directory structure we will now place/save the data file in the `data/` directory. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Download the gapminder data from [this link to a csv file](data/gapminder_data.csv). + +1. Download the file (right mouse click on the link above -> "Save link as" / "Save file as", or click on the link and after the page loads, press Ctrl\+S or choose File -> "Save page as") +2. Make sure it's saved under the name `gapminder_data.csv` +3. Save the file in the `data/` folder within your project. + +We will load and inspect these data later. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +It is useful to get some general idea about the dataset, directly from the +command line, before loading it into R. Understanding the dataset better +will come in handy when making decisions on how to load it in R. Use the command-line +shell to answer the following questions: + +1. What is the size of the file? +2. How many rows of data does it contain? +3. What kinds of values are stored in this file? + +::::::::::::::: solution + +## Solution to Challenge 4 + +By running these commands in the shell: + +```{r ch2a-sol, engine="sh"} +ls -lh data/gapminder_data.csv +``` + +The file size is 80K. + +```{r ch2b-sol, engine="sh"} +wc -l data/gapminder_data.csv +``` + +There are 1705 lines. The data looks like: + +```{r ch2c-sol, engine="sh"} +head data/gapminder_data.csv +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: command line in RStudio + +The Terminal tab in the console pane provides a convenient place directly +within RStudio to interact directly with the command line. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Working directory + +Knowing R's current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory. + +Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing `.Rproj` file, it will open that project and set R's working directory to the folder that file is in. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +You can check the current working directory with the `getwd()` command, or by using the menus in RStudio. + +1. In the console, type `getwd()` ("wd" is short for "working directory") and hit Enter. +2. In the Files pane, double click on the `data` folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click "More" and then select "Go To Working Directory". + +You can change the working directory with `setwd()`, or by using RStudio menus. + +1. In the console, type `setwd("data")` and hit Enter. Type `getwd()` and hit Enter to see the new working directory. +2. In the menus at the top of the RStudio window, click the "Session" menu button, and then select "Set Working Directory" and then "Choose Directory". Next, in the windows navigator that opens, navigate back to the project directory, and click "Open". Note that a `setwd` command will automatically appear in the console. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: File does not exist errors + +When you're attempting to reference a file in your R code and you're getting errors saying the file doesn't exist, it's a good idea to check your working directory. +You need to either provide an absolute path to the file, or you need to make sure the file is saved in the working directory (or a subfolder of the working directory) and provide a relative path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Version Control + +It is important to use version control with projects. Go [here for a good lesson which describes using Git with RStudio](https://swcarpentry.github.io/git-novice/14-supplemental-rstudio.html). + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use RStudio to create and manage projects with consistent layout. +- Treat raw data as read-only. +- Treat generated output as disposable. +- Separate function definition and application. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/03-seeking-help.Rmd b/locale/uk/episodes/03-seeking-help.Rmd new file mode 100644 index 000000000..cc2e3f7b8 --- /dev/null +++ b/locale/uk/episodes/03-seeking-help.Rmd @@ -0,0 +1,267 @@ +--- +title: Seeking Help +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to read R help files for functions and special operators. +- To be able to use CRAN task views to identify packages to solve a problem. +- To be able to seek help from your peers. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I get help in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +## Reading Help Files + +R, and every package, provide help files for functions. The general syntax to search for help on any +function, "function\_name", from a specific function that is in a package loaded into your +namespace (your interactive R session) is: + +```{r, eval=FALSE} +?function_name +help(function_name) +``` + +For example take a look at the help file for `write.table()`, we will be using a similar function in an upcoming episode. + +```{r, eval=FALSE} +?write.table() +``` + +This will load up a help page in RStudio (or as plain text in R itself). + +Each help page is broken down into sections: + +- Description: An extended description of what the function does. +- Usage: The arguments of the function and their default values (which can be changed). +- Arguments: An explanation of the data each argument is expecting. +- Details: Any important details to be aware of. +- Value: The data the function returns. +- See Also: Any related functions you might find useful. +- Examples: Some examples for how to use the function. + +Different functions might have different sections, but these are the main ones you should be aware of. + +Notice how related functions might call for the same help file: + +```{r, eval=FALSE} +?write.table() +?write.csv() +``` + +This is because these functions have very similar applicability and often share the same arguments as inputs to the function, so package authors often choose to document them together in a single help file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Running Examples + +From within the function help page, you can highlight code in the +Examples and hit Ctrl\+Return to run it in +RStudio console. This gives you a quick way to get a feel for +how a function works. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Reading Help Files + +One of the most daunting aspects of R is the large number of functions +available. It would be prohibitive, if not impossible to remember the +correct usage for every function you use. Luckily, using the help files +means you don't have to remember that! + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Special Operators + +To seek help on special operators, use quotes or backticks: + +```{r, eval=FALSE} +?"<-" +?`<-` +``` + +## Getting Help with Packages + +Many packages come with "vignettes": tutorials and extended example documentation. +Without any arguments, `vignette()` will list all vignettes for all installed packages; +`vignette(package="package-name")` will list all available vignettes for +`package-name`, and `vignette("vignette-name")` will open the specified vignette. + +If a package doesn't have any vignettes, you can usually find help by typing +`help("package-name")`. + +RStudio also has a set of excellent +[cheatsheets](https://rstudio.com/resources/cheatsheets/) for many packages. + +## When You Remember Part of the Function Name + +If you're not sure what package a function is in or how it's specifically spelled, you can do a fuzzy search: + +```{r, eval=FALSE} +??function_name +``` + +A fuzzy search is when you search for an approximate string match. For example, you may remember that the function +to set your working directory includes "set" in its name. You can do a fuzzy search to help you identify the function: + +```{r, eval=FALSE} +??set +``` + +## When You Have No Idea Where to Begin + +If you don't know what function or package you need to use +[CRAN Task Views](https://cran.at.r-project.org/web/views) +is a specially maintained list of packages grouped into +fields. This can be a good starting point. + +## When Your Code Doesn't Work: Seeking Help from Your Peers + +If you're having trouble using a function, 9 times out of 10, +the answers you seek have already been answered on +[Stack Overflow](https://stackoverflow.com/). You can search using +the `[r]` tag. Please make sure to see their page on +[how to ask a good question.](https://stackoverflow.com/help/how-to-ask) + +If you can't find the answer, there are a few useful functions to +help you ask your peers: + +```{r, eval=FALSE} +?dput +``` + +Will dump the data you're working with into a format that can +be copied and pasted by others into their own R session. + +```{r} +sessionInfo() +``` + +Will print out your current version of R, as well as any packages you +have loaded. This can be useful for others to help reproduce and debug +your issue. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Look at the help page for the `c` function. What kind of vector do you +expect will be created if you evaluate the following: + +```{r, eval=FALSE} +c(1, 2, 3) +c('d', 'e', 'f') +c(1, 2, 'f') +``` + +::::::::::::::: solution + +## Solution to Challenge 1 + +The `c()` function creates a vector, in which all elements are of the +same type. In the first case, the elements are numeric, in the +second, they are characters, and in the third they are also characters: +the numeric values are "coerced" to be characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Look at the help for the `paste` function. You will need to use it later. +What's the difference between the `sep` and `collapse` arguments? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To look at the help for the `paste()` function, use: + +```{r, eval=FALSE} +help("paste") +?paste +``` + +The difference between `sep` and `collapse` is a little +tricky. The `paste` function accepts any number of arguments, each of which +can be a vector of any length. The `sep` argument specifies the string +used between concatenated terms — by default, a space. The result is a +vector as long as the longest argument supplied to `paste`. In contrast, +`collapse` specifies that after concatenation the elements are _collapsed_ +together using the given separator, the result being a single string. + +It is important to call the arguments explicitly by typing out the argument +name e.g `sep = ","` so the function understands to use the "," as a +separator and not a term to concatenate. +e.g. + +```{r} +paste(c("a","b"), "c") +paste(c("a","b"), "c", ",") +paste(c("a","b"), "c", sep = ",") +paste(c("a","b"), "c", collapse = "|") +paste(c("a","b"), "c", sep = ",", collapse = "|") +``` + +(For more information, +scroll to the bottom of the `?paste` help page and look at the +examples, or try `example('paste')`.) + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use help to find a function (and its associated parameters) that you could +use to load data from a tabular file in which columns are delimited with "\\t" +(tab) and the decimal point is a "." (period). This check for decimal +separator is important, especially if you are working with international +colleagues, because different countries have different conventions for the +decimal point (i.e. comma vs period). +Hint: use `??"read table"` to look up functions related to reading in tabular data. + +::::::::::::::: solution + +## Solution to Challenge 3 + +The standard R function for reading tab-delimited files with a period +decimal separator is read.delim(). You can also do this with +`read.table(file, sep="\t")` (the period is the _default_ decimal +separator for `read.table()`), although you may have to change +the `comment.char` argument as well if your data file contains +hash (#) characters. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other Resources + +- [Quick R](https://www.statmethods.net/) +- [RStudio cheat sheets](https://www.rstudio.com/resources/cheatsheets/) +- [Cookbook for R](https://www.cookbook-r.com/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `help()` to get online help in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/04-data-structures-part1.Rmd b/locale/uk/episodes/04-data-structures-part1.Rmd new file mode 100644 index 000000000..b11c2a52c --- /dev/null +++ b/locale/uk/episodes/04-data-structures-part1.Rmd @@ -0,0 +1,1101 @@ +--- +title: Data Structures +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to identify the 5 main data types. +- To begin exploring data frames, and understand how they are related to vectors and lists. +- To be able to ask questions from R about the type, class, and structure of an object. +- To understand the information of the attributes "names", "class", and "dim". + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I read data in R? +- What are the basic data types in R? +- How do I represent categorical information in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +options(stringsAsFactors = FALSE) +cats_orig <- data.frame(coat = c("calico", "black", "tabby"), weight = c(2.1, 5, 3.2), likes_catnip = c(1, 0, 1), stringsAsFactors = FALSE) +cats_bad <- data.frame(coat = c("calico", "black", "tabby", "tabby"), weight = c(2.1, 5, 3.2, "2.3 or 2.4"), likes_catnip = c(1, 0, 1, 1), stringsAsFactors = FALSE) +cats <- cats_orig +``` + +One of R's most powerful features is its ability to deal with tabular data - +such as you may already have in a spreadsheet or a CSV file. Let's start by +making a toy dataset in your `data/` directory, called `feline-data.csv`: + +```{r} +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_catnip = c(1, 0, 1)) +``` + +We can now save `cats` as a CSV file. It is good practice to call the argument +names explicitly so the function knows what default values you are changing. Here we +are setting `row.names = FALSE`. Recall you can use `?write.csv` to pull +up the help file to check out the argument names and their default values. + +```{r} +write.csv(x = cats, file = "data/feline-data.csv", row.names = FALSE) +``` + +The contents of the new file, `feline-data.csv`: + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Editing Text files in R + +Alternatively, you can create `data/feline-data.csv` using a text editor (Nano), +or within RStudio with the **File -> New File -> Text File** menu item. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can load this into R via the following: + +```{r} +cats <- read.csv(file = "data/feline-data.csv") +cats +``` + +The `read.table` function is used for reading in tabular data stored in a text +file where the columns of data are separated by punctuation characters such as +CSV files (csv = comma-separated values). Tabs and commas are the most common +punctuation characters used to separate or delimit data points in csv files. +For convenience R provides 2 other versions of `read.table`. These are: `read.csv` +for files where the data are separated with commas and `read.delim` for files +where the data are separated with tabs. Of these three functions `read.csv` is +the most commonly used. If needed it is possible to override the default +delimiting punctuation marks for both `read.csv` and `read.delim`. + +::::::::::::::::::::::::::::::::::::::::: callout + +### Check your data for factors + +In recent times, the default way how R handles textual data has changed. Text +data was interpreted by R automatically into a format called "factors". But +there is an easier format that is called "character". We will hear about +factors later, and what to use them for. For now, remember that in most cases, +they are not needed and only complicate your life, which is why newer R +versions read in text as "character". Check now if your version of R has +automatically created factors and convert them to "character" format: + +1. Check the data types of your input by typing `str(cats)` +2. In the output, look at the three-letter codes after the colons: If you see + only "num" and "chr", you can continue with the lesson and skip this box. + If you find "fct", continue to step 3. +3. Prevent R from automatically creating "factor" data. That can be done by + the following code: `options(stringsAsFactors = FALSE)`. Then, re-read + the cats table for the change to take effect. +4. You must set this option every time you restart R. To not forget this, + include it in your analysis script before you read in any data, for example + in one of the first lines. +5. For R versions greater than 4.0.0, text data is no longer converted to + factors anymore. So you can install this or a newer version to avoid this + problem. If you are working on an institute or company computer, ask your + administrator to do it. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can begin exploring our dataset right away, pulling out columns by specifying +them using the `$` operator: + +```{r} +cats$weight +cats$coat +``` + +We can do other operations on the columns: + +```{r} +## Say we discovered that the scale weighs two Kg light: +cats$weight + 2 +paste("My cat is", cats$coat) +``` + +But what about + +```{r} +cats$weight + cats$coat +``` + +Understanding what happened here is key to successfully analyzing data in R. + +### Data Types + +If you guessed that the last command will return an error because `2.1` plus +`"black"` is nonsense, you're right - and you already have some intuition for an +important concept in programming called _data types_. We can ask what type of +data something is: + +```{r} +typeof(cats$weight) +``` + +There are 5 main types: `double`, `integer`, `complex`, `logical` and `character`. +For historic reasons, `double` is also called `numeric`. + +```{r} +typeof(3.14) +typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers +typeof(1+1i) +typeof(TRUE) +typeof('banana') +``` + +No matter how +complicated our analyses become, all data in R is interpreted as one of these +basic data types. This strictness has some really important consequences. + +A user has added details of another cat. This information is in the file +`data/feline-data_v2.csv`. + +```{r, eval=FALSE} +file.show("data/feline-data_v2.csv") +``` + +```{r, eval=FALSE} +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +tabby,2.3 or 2.4,1 +``` + +Load the new cats data like before, and check what type of data we find in the +`weight` column: + +```{r} +cats <- read.csv(file="data/feline-data_v2.csv") +typeof(cats$weight) +``` + +Oh no, our weights aren't the double type anymore! If we try to do the same math +we did on them before, we run into trouble: + +```{r} +cats$weight + 2 +``` + +What happened? +The `cats` data we are working with is something called a _data frame_. Data frames +are one of the most common and versatile types of _data structures_ we will work with in R. +A given column in a data frame cannot be composed of different data types. +In this case, R does not read everything in the data frame column `weight` as a _double_, therefore the entire +column data type changes to something that is suitable for everything in the column. + +When R reads a csv file, it reads it in as a _data frame_. Thus, when we loaded the `cats` +csv file, it is stored as a data frame. We can recognize data frames by the first row that +is written by the `str()` function: + +```{r} +str(cats) +``` + +_Data frames_ are composed of rows and columns, where each column has the +same number of rows. Different columns in a data frame can be made up of different +data types (this is what makes them so versatile), but everything in a given +column needs to be the same type (e.g., vector, factor, or list). + +Let's explore more about different data structures and how they behave. +For now, let's remove that extra line from our cats data and reload it, +while we investigate this behavior further: + +feline-data.csv: + +``` +coat,weight,likes_catnip +calico,2.1,1 +black,5.0,0 +tabby,3.2,1 +``` + +And back in RStudio: + +```{r, eval=FALSE} +cats <- read.csv(file="data/feline-data.csv") +``` + +```{r, include=FALSE} +cats <- cats_orig +``` + +### Vectors and Type Coercion + +To better understand this behavior, let's meet another of the data structures: +the _vector_. + +```{r} +my_vector <- vector(length = 3) +my_vector +``` + +A vector in R is essentially an ordered list of things, with the special +condition that _everything in the vector must be the same basic data type_. If +you don't choose the datatype, it'll default to `logical`; or, you can declare +an empty vector of whatever type you like. + +```{r} +another_vector <- vector(mode='character', length=3) +another_vector +``` + +You can check if something is a vector: + +```{r} +str(another_vector) +``` + +The somewhat cryptic output from this command indicates the basic data type +found in this vector - in this case `chr`, character; an indication of the +number of things in the vector - actually, the indexes of the vector, in this +case `[1:3]`; and a few examples of what's actually in the vector - in this case +empty character strings. If we similarly do + +```{r} +str(cats$weight) +``` + +we see that `cats$weight` is a vector, too - _the columns of data we load into R +data.frames are all vectors_, and that's the root of why R forces everything in +a column to be the same basic data type. + +:::::::::::::::::::::::::::::::::::::: discussion + +### Discussion 1 + +Why is R so opinionated about what we put in our columns of data? +How does this help us? + +::::::::::::::: solution + +### Discussion 1 + +By keeping everything in a column the same, we allow ourselves to make simple +assumptions about our data; if you can interpret one entry in the column as a +number, then you can interpret _all_ of them as numbers, so we don't have to +check every time. This consistency is what people mean when they talk about +_clean data_; in the long run, strict consistency goes a long way to making +our lives easier in R. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +#### Coercion by combining vectors + +You can also make vectors with explicit contents with the combine function: + +```{r} +combine_vector <- c(2,6,3) +combine_vector +``` + +Given what we've learned so far, what do you think the following will produce? + +```{r} +quiz_vector <- c(2,6,'3') +``` + +This is something called _type coercion_, and it is the source of many surprises +and the reason why we need to be aware of the basic data types and how R will +interpret them. When R encounters a mix of types (here double and character) to +be combined into a single vector, it will force them all to be the same +type. Consider: + +```{r} +coercion_vector <- c('a', TRUE) +coercion_vector +another_coercion_vector <- c(0, TRUE) +another_coercion_vector +``` + +#### The type hierarchy + +The coercion rules go: `logical` -> `integer` -> `double` ("`numeric`") -> +`complex` -> `character`, where -> can be read as _are transformed into_. For +example, combining `logical` and `character` transforms the result to +`character`: + +```{r} +c('a', TRUE) +``` + +A quick way to recognize `character` vectors is by the quotes that enclose them +when they are printed. + +You can try to force +coercion against this flow using the `as.` functions: + +```{r} +character_vector_example <- c('0','2','4') +character_vector_example +character_coerced_to_double <- as.double(character_vector_example) +character_coerced_to_double +double_coerced_to_logical <- as.logical(character_coerced_to_double) +double_coerced_to_logical +``` + +As you can see, some surprising things can happen when R forces one basic data +type into another! Nitty-gritty of type coercion aside, the point is: if your +data doesn't look like what you thought it was going to look like, type coercion +may well be to blame; make sure everything is the same type in your vectors and +your columns of data.frames, or you will get nasty surprises! + +But coercion can also be very useful! For example, in our `cats` data +`likes_catnip` is numeric, but we know that the 1s and 0s actually represent +`TRUE` and `FALSE` (a common way of representing them). We should use the +`logical` datatype here, which has two states: `TRUE` or `FALSE`, which is +exactly what our data represents. We can 'coerce' this column to be `logical` by +using the `as.logical` function: + +```{r} +cats$likes_catnip +cats$likes_catnip <- as.logical(cats$likes_catnip) +cats$likes_catnip +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 1 + +An important part of every data analysis is cleaning the input data. If you +know that the input data is all of the same format, (e.g. numbers), your +analysis is much easier! Clean the cat data set from the chapter about +type coercion. + +#### Copy the code template + +Create a new script in RStudio and copy and paste the following code. Then +move on to the tasks below, which help you to fill in the gaps (\_\_\_\_\_\_). + +``` +# Read data +cats <- read.csv("data/feline-data_v2.csv") + +# 1. Print the data +_____ + +# 2. Show an overview of the table with all data types +_____(cats) + +# 3. The "weight" column has the incorrect data type __________. +# The correct data type is: ____________. + +# 4. Correct the 4th weight data point with the mean of the two given values +cats$weight[4] <- 2.35 +# print the data again to see the effect +cats + +# 5. Convert the weight to the right data type +cats$weight <- ______________(cats$weight) + +# Calculate the mean to test yourself +mean(cats$weight) + +# If you see the correct mean value (and not NA), you did the exercise +# correctly! +``` + +### Instructions for the tasks + +#### 1\. Print the data + +Execute the first statement (`read.csv(...)`). Then print the data to the +console + +::::::::::::::: solution + +### Tip 1.1 + +Show the content of any variable by typing its name. + +### Solution to Challenge 1.1 + +Two correct solutions: + +``` +cats +print(cats) +``` + +::::::::::::::::::::::::: + +#### 2\. Overview of the data types + +The data type of your data is as important as the data itself. Use a +function we saw earlier to print out the data types of all columns of the +`cats` table. + +::::::::::::::: solution + +### Tip 1.2 + +In the chapter "Data types" we saw two functions that can show data types. +One printed just a single word, the data type name. The other printed +a short form of the data type, and the first few values. We need the second +here. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.2 +> +> ``` +> str(cats) +> ``` + +#### 3\. Which data type do we need? + +The shown data type is not the right one for this data (weight of +a cat). Which data type do we need? + +- Why did the `read.csv()` function not choose the correct data type? +- Fill in the gap in the comment with the correct data type for cat weight! + +::::::::::::::: solution + +### Tip 1.3 + +Scroll up to the section about the [type hierarchy](#the-type-hierarchy) +to review the available data types + +::::::::::::::::::::::::: + +::::::::::::::: solution + +### Solution to Challenge 1.3 + +- Weight is expressed on a continuous scale (real numbers). The R + data type for this is "double" (also known as "numeric"). +- The fourth row has the value "2.3 or 2.4". That is not a number + but two, and an english word. Therefore, the "character" data type + is chosen. The whole column is now text, because all values in the same + columns have to be the same data type. + +::::::::::::::::::::::::: + +#### 4\. Correct the problematic value + +The code to assign a new weight value to the problematic fourth row is given. +Think first and then execute it: What will be the data type after assigning +a number like in this example? +You can check the data type after executing to see if you were right. + +::::::::::::::: solution + +### Tip 1.4 + +Revisit the hierarchy of data types when two different data types are +combined. + +::::::::::::::::::::::::: + +> ### Solution to challenge 1.4 +> +> The data type of the column "weight" is "character". The assigned data +> type is "double". Combining two data types yields the data type that is +> higher in the following hierarchy: +> +> ``` +> logical < integer < double < complex < character +> ``` +> +> Therefore, the column is still of type character! We need to manually +> convert it to "double". +> {: .solution} + +#### 5\. Convert the column "weight" to the correct data type + +Cat weight are numbers. But the column does not have this data type yet. +Coerce the column to floating point numbers. + +::::::::::::::: solution + +### Tip 1.5 + +The functions to convert data types start with `as.`. You can look +for the function further up in the manuscript or use the RStudio +auto-complete function: Type "`as.`" and then press the TAB key. + +::::::::::::::::::::::::: + +> ### Solution to Challenge 1.5 +> +> There are two functions that are synonymous for historic reasons: +> +> ``` +> cats$weight <- as.double(cats$weight) +> cats$weight <- as.numeric(cats$weight) +> ``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Some basic vector functions + +The combine function, `c()`, will also append things to an existing vector: + +```{r} +ab_vector <- c('a', 'b') +ab_vector +combine_example <- c(ab_vector, 'SWC') +combine_example +``` + +You can also make series of numbers: + +```{r} +mySeries <- 1:10 +mySeries +seq(10) +seq(1,10, by=0.1) +``` + +We can ask a few questions about vectors: + +```{r} +sequence_example <- 20:25 +head(sequence_example, n=2) +tail(sequence_example, n=4) +length(sequence_example) +typeof(sequence_example) +``` + +We can get individual elements of a vector by using the bracket notation: + +```{r} +first_element <- sequence_example[1] +first_element +``` + +To change a single element, use the bracket on the other side of the arrow: + +```{r} +sequence_example[1] <- 30 +sequence_example +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 2 + +Start by making a vector with the numbers 1 through 26. +Then, multiply the vector by 2. + +::::::::::::::: solution + +### Solution to Challenge 2 + +```{r} +x <- 1:26 +x <- x * 2 +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +### Lists + +Another data structure you'll want in your bag of tricks is the `list`. A list +is simpler in some ways than the other types, because you can put anything you +want in it. Remember _everything in the vector must be of the same basic data type_, +but a list can have different data types: + +```{r} +list_example <- list(1, "a", TRUE, 1+4i) +list_example +``` + +When printing the object structure with `str()`, we see the data types of all +elements: + +```{r} +str(list_example) +``` + +What is the use of lists? They can **organize data of different types**. For +example, you can organize different tables that belong together, similar to +spreadsheets in Excel. But there are many other uses, too. + +We will see another example that will maybe surprise you in the next chapter. + +To retrieve one of the elements of a list, use the **double bracket**: + +```{r} +list_example[[2]] +``` + +The elements of lists also can have **names**, they can be given by prepending +them to the values, separated by an equals sign: + +```{r} +another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE ) +another_list +``` + +This results in a **named list**. Now we have a new function of our object! +We can access single elements by an additional way! + +```{r} +another_list$title +``` + +## Names + +With names, we can give meaning to elements. It is the first time that we do not +only have the **data**, but also explaining information. It is _metadata_ +that can be stuck to the object like a label. In R, this is called an +**attribute**. Some attributes enable us to do more with our +object, for example, like here, accessing an element by a self-defined name. + +### Accessing vectors and lists by name + +We have already seen how to generate a named list. The way to generate a named +vector is very similar. You have seen this function before: + +```{r} +pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 ) +``` + +The way to retrieve elements is different, though: + +```{r} +pizza_price["pizzasubito"] +``` + +The approach used for the list does not work: + +```{r} +pizza_price$pizzafresh +``` + +It will pay off if you remember this error message, you will meet it in your own +analyses. It means that you have just tried accessing an element like it was in +a list, but it is actually in a vector. + +### Accessing and changing names + +If you are only interested in the names, use the `names()` function: + +```{r} +names(pizza_price) +``` + +We have seen how to access and change single elements of a vector. The same is +possible for names: + +```{r} +names(pizza_price)[3] +names(pizza_price)[3] <- "call-a-pizza" +pizza_price +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 3 + +- What is the data type of the names of `pizza_price`? You can find out + using the `str()` or `typeof()` functions. + +::::::::::::::: solution + +### Solution to Challenge 3 + +You get the names of an object by wrapping the object name inside +`names(...)`. Similarly, you get the data type of the names by again +wrapping the whole code in `typeof(...)`: + +``` +typeof(names(pizza)) +``` + +alternatively, use a new variable if this is easier for you to read: + +``` +n <- names(pizza) +typeof(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 4 + +Instead of just changing some of the names a vector/list already has, you can +also set all names of an object by writing code like (replace ALL CAPS text): + +``` +names( OBJECT ) <- CHARACTER_VECTOR +``` + +Create a vector that gives the number for each letter in the alphabet! + +1. Generate a vector called `letter_no` with the sequence of numbers from 1 + to 26! +2. R has a built-in object called `LETTERS`. It is a 26-character vector, from + A to Z. Set the names of the number sequence to this 26 letters +3. Test yourself by calling `letter_no["B"]`, which should give you the number + 2! + +::::::::::::::: solution + +### Solution to Challenge 4 + +``` +letter_no <- 1:26 # or seq(1,26) +names(letter_no) <- LETTERS +letter_no["B"] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +We have data frames at the very beginning of this lesson, they represent +a table of data. We didn't go much further into detail with our example cat +data frame: + +```{r} +cats +``` + +We can now understand something a bit surprising in our data.frame; what happens +if we run: + +```{r} +typeof(cats) +``` + +We see that data.frames look like lists 'under the hood'. Think again what we +heard about what lists can be used for: + +> Lists organize data of different types + +Columns of a data frame are vectors of different types, that are organized +by belonging to the same table. + +A data.frame is really a list of vectors. It is a special list in which all the +vectors must have the same length. + +How is this "special"-ness written into the object, so that R does not treat it +like any other list, but as a table? + +```{r} +class(cats) +``` + +A **class**, just like names, is an attribute attached to the object. It tells +us what this object means for humans. + +You might wonder: Why do we need another what-type-of-object-is-this-function? +We already have `typeof()`? That function tells us how the object is +**constructed in the computer**. The `class` is the **meaning of the object for +humans**. Consequently, what `typeof()` returns is _fixed_ in R (mainly the +five data types), whereas the output of `class()` is _diverse_ and _extendable_ +by R packages. + +In our `cats` example, we have an integer, a double and a logical variable. As +we have seen already, each column of data.frame is a vector. + +```{r} +cats$coat +cats[,1] +typeof(cats[,1]) +str(cats[,1]) +``` + +Each row is an _observation_ of different variables, itself a data.frame, and +thus can be composed of elements of different types. + +```{r} +cats[1,] +typeof(cats[1,]) +str(cats[1,]) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 5 + +There are several subtly different ways to call variables, observations and +elements from data.frames: + +- `cats[1]` +- `cats[[1]]` +- `cats$coat` +- `cats["coat"]` +- `cats[1, 1]` +- `cats[, 1]` +- `cats[1, ]` + +Try out these examples and explain what is returned by each one. + +_Hint:_ Use the function `typeof()` to examine what is returned in each case. + +::::::::::::::: solution + +### Solution to Challenge 5 + +```{r, eval=TRUE, echo=TRUE} +cats[1] +``` + +We can think of a data frame as a list of vectors. The single brace `[1]` +returns the first slice of the list, as another list. In this case it is the +first column of the data frame. + +```{r, eval=TRUE, echo=TRUE} +cats[[1]] +``` + +The double brace `[[1]]` returns the contents of the list item. In this case +it is the contents of the first column, a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats$coat +``` + +This example uses the `$` character to address items by name. _coat_ is the +first column of the data frame, again a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats["coat"] +``` + +Here we are using a single brace `["coat"]` replacing the index number with +the column name. Like example 1, the returned object is a _list_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, 1] +``` + +This example uses a single brace, but this time we provide row and column +coordinates. The returned object is the value in row 1, column 1. The object +is a _vector_ of type _character_. + +```{r, eval=TRUE, echo=TRUE} +cats[, 1] +``` + +Like the previous example we use single braces and provide row and column +coordinates. The row coordinate is not specified, R interprets this missing +value as all the elements in this _column_ and returns them as a _vector_. + +```{r, eval=TRUE, echo=TRUE} +cats[1, ] +``` + +Again we use the single brace with row and column coordinates. The column +coordinate is not specified. The return value is a _list_ containing all the +values in the first row. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +### Tip: Renaming data frame columns + +Data frames have column names, which can be accessed with the `names()` function. + +```{r} +names(cats) +``` + +If you want to rename the second column of `cats`, you can assign a new name to the second element of `names(cats)`. + +```{r} +names(cats)[2] <- "weight_kg" +cats +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +# reverting cats back to original version +cats <- cats_orig +``` + +### Matrices + +Last but not least is the matrix. We can declare a matrix full of zeros: + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +matrix_example +``` + +What makes it special is the `dim()` attribute: + +```{r} +dim(matrix_example) +``` + +And similar to other data structures, we can ask things about our matrix: + +```{r} +typeof(matrix_example) +class(matrix_example) +str(matrix_example) +nrow(matrix_example) +ncol(matrix_example) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? +Try it. +Were you right? Why / why not? + +::::::::::::::: solution + +### Solution to Challenge 6 + +What do you think will be the result of +`length(matrix_example)`? + +```{r} +matrix_example <- matrix(0, ncol=6, nrow=3) +length(matrix_example) +``` + +Because a matrix is a vector with added dimension attributes, `length` +gives you the total number of elements in the matrix. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +::::::::::::::: solution + +### Solution to Challenge 7 + +Make another matrix, this time containing the numbers 1:50, +with 5 columns and 10 rows. +Did the `matrix` function fill your matrix by column, or by +row, as its default behaviour? +See if you can figure out how to change this. +(hint: read the documentation for `matrix`!) + +```{r, eval=FALSE} +x <- matrix(1:50, ncol=5, nrow=10) +x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 8 + +Create a list of length two containing a character vector for each of the sections in this part of the workshop: + +- Data types +- Data structures + +Populate each character vector with the names of the data types and data +structures we've seen so far. + +::::::::::::::: solution + +### Solution to Challenge 8 + +```{r} +dataTypes <- c('double', 'complex', 'integer', 'character', 'logical') +dataStructures <- c('data.frame', 'vector', 'list', 'matrix') +answer <- list(dataTypes, dataStructures) +``` + +Note: it's nice to make a list in big writing on the board or taped to the wall +listing all of these types and structures - leave it up for the rest of the workshop +to remind people of the importance of these basics. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +### Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +1. `matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)` +2. `matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)` +3. `matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)` +4. `matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)` + +::::::::::::::: solution + +### Solution to Challenge 9 + +Consider the R output of the matrix below: + +```{r, echo=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +What was the correct command used to write this matrix? Examine +each command and try to figure out the correct one before typing them. +Think about what matrices the other commands will produce. + +```{r, eval=FALSE} +matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `read.csv` to read tabular data in R. +- The basic data types in R are double, integer, complex, logical, and character. +- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/05-data-structures-part2.Rmd b/locale/uk/episodes/05-data-structures-part2.Rmd new file mode 100644 index 000000000..abc4d714a --- /dev/null +++ b/locale/uk/episodes/05-data-structures-part2.Rmd @@ -0,0 +1,395 @@ +--- +title: Exploring Data Frames +teaching: 20 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Add and remove rows or columns. +- Append two data frames. +- Display basic properties of data frames including size and class of the columns, names, and first few rows. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +``` + +At this point, you've seen it all: in the last lesson, we toured all the basic +data types and data structures in R. Everything you do will be a manipulation of +those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we'll learn a few more things +about working with data frames. + +## Adding columns and rows in data frames + +We already learned that the columns of a data frame are vectors, so that our +data are consistent in type throughout the columns. As such, if we want to add a +new column, we can start by making a new vector: + +```{r, echo=FALSE} +cats <- read.csv("data/feline-data.csv") +``` + +```{r} +age <- c(2, 3, 5) +cats +``` + +We can then add this as a column via: + +```{r} +cbind(cats, age) +``` + +Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail: + +```{r, error=TRUE} +age <- c(2, 3, 5, 12) +cbind(cats, age) + +age <- c(2, 3) +cbind(cats, age) +``` + +Why didn't this work? Of course, R wants to see one element in our new column +for every row in the table: + +```{r} +nrow(cats) +length(age) +``` + +So for it to work we need to have `nrow(cats)` = `length(age)`. Let's overwrite the content of cats with our new data frame. + +```{r} +age <- c(2, 3, 5) +cats <- cbind(cats, age) +``` + +Now how about adding rows? We already know that the rows of a +data frame are lists: + +```{r} +newRow <- list("tortoiseshell", 3.3, TRUE, 9) +cats <- rbind(cats, newRow) +``` + +Let's confirm that our new row was added correctly. + +```{r} +cats +``` + +## Removing rows + +We now know how to add rows and columns to our data frame in R. Now let's learn to remove rows. + +```{r} +cats +``` + +We can ask for a data frame minus the last row: + +```{r} +cats[-4, ] +``` + +Notice the comma with nothing after it to indicate that we want to drop the entire fourth row. + +Note: we could also remove several rows at once by putting the row numbers +inside of a vector, for example: `cats[c(-3,-4), ]` + +## Removing columns + +We can also remove columns in our data frame. What if we want to remove the column "age". We can remove it in two ways, by variable number or by index. + +```{r} +cats[,-4] +``` + +Notice the comma with nothing before it, indicating we want to keep all of the rows. + +Alternatively, we can drop the column by using the index name and the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `cats`, and asks, "Does this element occur in the second argument?" + +```{r} +drop <- names(cats) %in% c("age") +cats[,!drop] +``` + +We will cover subsetting with logical operators like `%in%` in more detail in the next episode. See the section [Subsetting through other logical operations](06-data-subsetting.Rmd) + +## Appending to a data frame + +The key to remember when adding data to a data frame is that _columns are +vectors and rows are lists._ We can also glue two data frames +together with `rbind`: + +```{r} +cats <- rbind(cats, cats) +cats +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +You can create a new data frame right from within R with the following syntax: + +```{r} +df <- data.frame(id = c("a", "b", "c"), + x = 1:3, + y = c(TRUE, TRUE, FALSE)) +``` + +Make a data frame that holds the following information for yourself: + +- first name +- last name +- lucky number + +Then use `rbind` to add an entry for the people sitting beside you. +Finally, use `cbind` to add a column with each person's answer to the question, "Is it time for coffee break?" + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +df <- data.frame(first = c("Grace"), + last = c("Hopper"), + lucky_number = c(0)) +df <- rbind(df, list("Marie", "Curie", 238) ) +df <- cbind(df, coffeetime = c(TRUE,TRUE)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Realistic example + +So far, you have seen the basics of manipulating data frames with our cat data; +now let's use those skills to digest a more realistic dataset. Let's read in the +`gapminder` dataset that we downloaded previously: + +```{r} +gapminder <- read.csv("data/gapminder_data.csv") +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Miscellaneous Tips + +- Another type of file you might encounter are tab-separated value files (.tsv). To specify a tab as a separator, use `"\\t"` or `read.delim()`. + +- Files can also be downloaded directly from the Internet into a local + folder of your choice onto your computer using the `download.file` function. + The `read.csv` function can then be executed to read the downloaded file from the download location, for example, + +```{r, eval=FALSE, echo=TRUE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv("data/gapminder_data.csv") +``` + +- Alternatively, you can also read in files directly into R from the Internet by replacing the file paths with a web address in `read.csv`. One should note that in doing this no local copy of the csv file is first saved onto your computer. For example, + +```{r, eval=FALSE, echo=TRUE} +gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv") +``` + +- You can read directly from excel spreadsheets without + converting them to plain text first by using the [readxl](https://cran.r-project.org/package=readxl) package. + +- The argument "stringsAsFactors" can be useful to tell R how to read strings either as factors or as character strings. In R versions after 4.0, all strings are read-in as characters by default, but in earlier versions of R, strings are read-in as factors by default. For more information, see the call-out in [the previous episode](https://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1.html#check-your-data-for-factors). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's investigate gapminder a bit; the first thing we should always do is check +out what the data looks like with `str`: + +```{r} +str(gapminder) +``` + +An additional method for examining the structure of gapminder is to use the `summary` function. This function can be used on various objects in R. For data frames, `summary` yields a numeric, tabular, or descriptive summary of each column. Numeric or integer columns are described by the descriptive statistics (quartiles and mean), and character columns by its length, class, and mode. + +```{r} +summary(gapminder) +``` + +Along with the `str` and `summary` functions, we can examine individual columns of the data frame with our `typeof` function: + +```{r} +typeof(gapminder$year) +typeof(gapminder$country) +str(gapminder$country) +``` + +We can also interrogate the data frame for information about its dimensions; +remembering that `str(gapminder)` said there were 1704 observations of 6 +variables in gapminder, what do you think the following will produce, and why? + +```{r} +length(gapminder) +``` + +A fair guess would have been to say that the length of a data frame would be the +number of rows it has (1704), but this is not the case; remember, a data frame +is a _list of vectors and factors_: + +```{r} +typeof(gapminder) +``` + +When `length` gave us 6, it's because gapminder is built out of a list of 6 +columns. To get the number of rows and columns in our dataset, try: + +```{r} +nrow(gapminder) +ncol(gapminder) +``` + +Or, both at once: + +```{r} +dim(gapminder) +``` + +We'll also likely want to know what the titles of all the columns are, so we can +ask for them later: + +```{r} +colnames(gapminder) +``` + +At this stage, it's important to ask ourselves if the structure R is reporting +matches our intuition or expectations; do the basic data types reported for each +column make sense? If not, we need to sort any problems out now before they turn +into bad surprises down the road, using what we've learned about how R +interprets data, and the importance of _strict consistency_ in how we record our +data. + +Once we're happy that the data types and structures seem reasonable, it's time +to start digging into our data proper. Check out the first few lines: + +```{r} +head(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +It's good practice to also check the last few lines of your data and some in the middle. How would you do this? + +Searching for ones specifically in the middle isn't too hard, but we could ask for a few lines at random. How would you code this? + +::::::::::::::: solution + +## Solution to Challenge 2 + +To check the last few lines it's relatively simple as R already has a function for this: + +```r +tail(gapminder) +tail(gapminder, n = 15) +``` + +What about a few arbitrary rows just in case something is odd in the middle? + +## Tip: There are several ways to achieve this. + +The solution here presents one form of using nested functions, i.e. a function passed as an argument to another function. This might sound like a new concept, but you are already using it! +Remember my\_dataframe[rows, cols] will print to screen your data frame with the number of rows and columns you asked for (although you might have asked for a range or named columns for example). How would you get the last row if you don't know how many rows your data frame has? R has a function for this. What about getting a (pseudorandom) sample? R also has a function for this. + +```r +gapminder[sample(nrow(gapminder), 5), ] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To make sure our analysis is reproducible, we should put the code +into a script file so we can come back to it later. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Go to file -> new file -> R script, and write an R script +to load in the gapminder dataset. Put it in the `scripts/` +directory and add it to version control. + +Run the script using the `source` function, using the file path +as its argument (or by pressing the "source" button in RStudio). + +::::::::::::::: solution + +## Solution to Challenge 3 + +The `source` function can be used to use a script within a script. +Assume you would like to load the same type of file over and over +again and therefore you need to specify the arguments to fit the +needs of your file. Instead of writing the necessary argument again +and again you could just write it once and save it as a script. Then, +you can use `source("Your_Script_containing_the_load_function")` in a new +script to use the function of that script without writing everything again. +Check out `?source` to find out more. + +```{r, eval=FALSE} +download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/main/episodes/data/gapminder_data.csv", destfile = "data/gapminder_data.csv") +gapminder <- read.csv(file = "data/gapminder_data.csv") +``` + +To run the script and load the data into the `gapminder` variable: + +```{r, eval=FALSE} +source(file = "scripts/load-gapminder.R") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Read the output of `str(gapminder)` again; +this time, use what you've learned about lists and vectors, +as well as the output of functions like `colnames` and `dim` +to explain what everything that `str` prints out for gapminder means. +If there are any parts you can't interpret, discuss with your neighbors! + +::::::::::::::: solution + +## Solution to Challenge 4 + +The object `gapminder` is a data frame with columns + +- `country` and `continent` are character strings. +- `year` is an integer vector. +- `pop`, `lifeExp`, and `gdpPercap` are numeric vectors. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `cbind()` to add a new column to a data frame. +- Use `rbind()` to add a new row to a data frame. +- Remove rows from a data frame. +- Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `head()`, and `typeof()` to understand the structure of a data frame. +- Read in a csv file using `read.csv()`. +- Understand what `length()` of a data frame represents. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/06-data-subsetting.Rmd b/locale/uk/episodes/06-data-subsetting.Rmd new file mode 100644 index 000000000..23242457e --- /dev/null +++ b/locale/uk/episodes/06-data-subsetting.Rmd @@ -0,0 +1,863 @@ +--- +title: Subsetting Data +teaching: 35 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to subset vectors, factors, matrices, lists, and data frames +- To be able to extract individual and multiple elements: by index, by name, using comparison operations +- To be able to skip and remove elements from various data structures. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I work with subsets of data in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +R has many powerful subset operators. Mastering them will allow you to +easily perform complex operations on any kind of dataset. + +There are six different ways we can subset any kind of object, and three +different subsetting operators for the different data structures. + +Let's start with the workhorse of R: a simple numeric vector. + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +x +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Atomic vectors + +In R, simple vectors containing character strings, numbers, or logical values are called _atomic_ vectors because they can't be further simplified. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +So now that we've created a dummy vector to play with, how do we get at its +contents? + +## Accessing elements using their indices + +To extract elements of a vector we can give their corresponding index, starting +from one: + +```{r} +x[1] +``` + +```{r} +x[4] +``` + +It may look different, but the square brackets operator is a function. For vectors +(and matrices), it means "get me the nth element". + +We can ask for multiple elements at once: + +```{r} +x[c(1, 3)] +``` + +Or slices of the vector: + +```{r} +x[1:4] +``` + +the `:` operator creates a sequence of numbers from the left element to the right. + +```{r} +1:4 +c(1, 2, 3, 4) +``` + +We can ask for the same element multiple times: + +```{r} +x[c(1,1,3)] +``` + +If we ask for an index beyond the length of the vector, R will return a missing value: + +```{r} +x[6] +``` + +This is a vector of length one containing an `NA`, whose name is also `NA`. + +If we ask for the 0th element, we get an empty vector: + +```{r} +x[0] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Vector numbering in R starts at 1 + +In many programming languages (C and Python, for example), the first +element of a vector has an index of 0. In R, the first element is 1. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping and removing elements + +If we use a negative number as the index of a vector, R will return +every element _except_ for the one specified: + +```{r} +x[-2] +``` + +We can skip multiple elements: + +```{r} +x[c(-1, -5)] # or x[-c(1,5)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Order of operations + +A common trip up for novices occurs when trying to skip +slices of a vector. It's natural to try to negate a +sequence like so: + +```{r, error=TRUE, eval=FALSE} +x[-1:3] +``` + +This gives a somewhat cryptic error: + +```{r, error=TRUE, echo=FALSE} +x[-1:3] +``` + +But remember the order of operations. `:` is really a function. +It takes its first argument as -1, and its second as 3, +so generates the sequence of numbers: `c(-1, 0, 1, 2, 3)`. + +The correct solution is to wrap that function call in brackets, so +that the `-` operator applies to the result: + +```{r} +x[-(1:3)] +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +To remove elements from a vector, we need to assign the result back +into the variable: + +```{r} +x <- x[-4] +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Come up with at least 2 different commands that will produce the following output: + +```{r, echo=FALSE} +x[2:4] +``` + +After you find 2 different commands, compare notes with your neighbour. Did you have different strategies? + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r} +x[2:4] +``` + +```{r} +x[-c(1,5)] +``` + +```{r} +x[c(2,3,4)] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Subsetting by name + +We can extract elements by using their name, instead of extracting by index: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly' +x[c("a", "c")] +``` + +This is usually a much more reliable way to subset objects: the +position of various elements can often change when chaining together +subsetting operations, but the names will always remain the same! + +## Subsetting through other logical operations {#logical-operations} + +We can also use any logical vector to subset: + +```{r} +x[c(FALSE, FALSE, TRUE, FALSE, TRUE)] +``` + +Since comparison operators (e.g. `>`, `<`, `==`) evaluate to logical vectors, we can also +use them to succinctly subset vectors: the following statement gives +the same result as the previous one. + +```{r} +x[x > 7] +``` + +Breaking it down, this statement first evaluates `x>7`, generating +a logical vector `c(FALSE, FALSE, TRUE, FALSE, TRUE)`, and then +selects the elements of `x` corresponding to the `TRUE` values. + +We can use `==` to mimic the previous method of indexing by name +(remember you have to use `==` rather than `=` for comparisons): + +```{r} +x[names(x) == "a"] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Combining logical conditions + +We often want to combine multiple logical +criteria. For example, we might want to find all the countries that are +located in Asia **or** Europe **and** have life expectancies within a certain +range. Several operations for combining logical vectors exist in R: + +- `&`, the "logical AND" operator: returns `TRUE` if both the left and right + are `TRUE`. +- `|`, the "logical OR" operator: returns `TRUE`, if either the left or right + (or both) are `TRUE`. + +You may sometimes see `&&` and `||` instead of `&` and `|`. These two-character operators +only look at the first element of each vector and ignore the +remaining elements. In general you should not use the two-character +operators in data analysis; save them +for programming, i.e. deciding whether to execute a statement. + +- `!`, the "logical NOT" operator: converts `TRUE` to `FALSE` and `FALSE` to + `TRUE`. It can negate a single logical condition (eg `!TRUE` becomes + `FALSE`), or a whole vector of conditions(eg `!c(TRUE, FALSE)` becomes + `c(FALSE, TRUE)`). + +Additionally, you can compare the elements within a single vector using the +`all` function (which returns `TRUE` if every element of the vector is `TRUE`) +and the `any` function (which returns `TRUE` if one or more elements of the +vector are `TRUE`). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Given the following code: + +```{r} +x <- c(5.4, 6.2, 7.1, 4.8, 7.5) +names(x) <- c('a', 'b', 'c', 'd', 'e') +print(x) +``` + +Write a subsetting command to return the values in x that are greater than 4 and less than 7. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r} +x_subset <- x[x<7 & x>4] +print(x_subset) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Non-unique names + +You should be aware that it is possible for multiple elements in a +vector to have the same name. (For a data frame, columns can have +the same name --- although R tries to avoid this --- but row names +must be unique.) Consider these examples: + +```{r} +x <- 1:3 +x +names(x) <- c('a', 'a', 'a') +x +x['a'] # only returns first value +x[names(x) == 'a'] # returns all three values +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Getting help for operators + +Remember you can search for help on operators by wrapping them in quotes: +`help("%in%")` or `?"%in%"`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Skipping named elements + +Skipping or removing named elements is a little harder. If we try to skip one named element by negating the string, R complains (slightly obscurely) that it doesn't know how to take the negative of a string: + +```{r} +x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly' +x[-"a"] +``` + +However, we can use the `!=` (not-equals) operator to construct a logical vector that will do what we want: + +```{r} +x[names(x) != "a"] +``` + +Skipping multiple named indices is a little bit harder still. Suppose we want to drop the `"a"` and `"c"` elements, so we try this: + +```{r} +x[names(x)!=c("a","c")] +``` + +R did _something_, but it gave us a warning that we ought to pay attention to - and it apparently _gave us the wrong answer_ (the `"c"` element is still included in the vector)! + +So what does `!=` actually do in this case? That's an excellent question. + +### Recycling + +Let's take a look at the comparison component of this code: + +```{r} +names(x) != c("a", "c") +``` + +Why does R give `TRUE` as the third element of this vector, when `names(x)[3] != "c"` is obviously false? +When you use `!=`, R tries to compare each element +of the left argument with the corresponding element of its right +argument. What happens when you compare vectors of different lengths? + +![](fig/06-rmd-inequality.1.png){alt='Inequality testing'} + +When one vector is shorter than the other, it gets _recycled_: + +![](fig/06-rmd-inequality.2.png){alt='Inequality testing: results of recycling'} + +In this case R **repeats** `c("a", "c")` as many times as necessary to match `names(x)`, i.e. we get `c("a","c","a","c","a")`. Since the recycled `"a"` +doesn't match the third element of `names(x)`, the value of `!=` is `TRUE`. +Because in this case the longer vector length (5) isn't a multiple of the shorter vector length (2), R printed a warning message. If we had been unlucky and `names(x)` had contained six elements, R would _silently_ have done the wrong thing (i.e., not what we intended it to do). This recycling rule can can introduce hard-to-find and subtle bugs! + +The way to get R to do what we really want (match _each_ element of the left argument with _all_ of the elements of the right argument) it to use the `%in%` operator. The `%in%` operator goes through each element of its left argument, in this case the names of `x`, and asks, "Does this element occur in the second argument?". Here, since we want to _exclude_ values, we also need a `!` operator to change "in" to "not in": + +```{r} +x[! names(x) %in% c("a","c") ] +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Selecting elements of a vector that match any of a list of components +is a very common data analysis task. For example, the gapminder data set +contains `country` and `continent` variables, but no information between +these two scales. Suppose we want to pull out information from southeast +Asia: how do we set up an operation to produce a logical vector that +is `TRUE` for all of the countries in southeast Asia and `FALSE` otherwise? + +Suppose you have these data: + +```{r} +seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos") +## read in the gapminder data that we downloaded in episode 2 +gapminder <- read.csv("data/gapminder_data.csv", header=TRUE) +## extract the `country` column from a data frame (we'll see this later); +## convert from a factor to a character; +## and get just the non-repeated elements +countries <- unique(as.character(gapminder$country)) +``` + +There's a wrong way (using only `==`), which will give you a warning; +a clunky way (using the logical operators `==` and `|`); and +an elegant way (using `%in%`). See whether you can come up with all three +and explain how they (don't) work. + +::::::::::::::: solution + +## Solution to challenge 3 + +- The **wrong** way to do this problem is `countries==seAsia`. This + gives a warning (`"In countries == seAsia : longer object length is not a multiple of shorter object length"`) and the wrong answer (a vector of all + `FALSE` values), because none of the recycled values of `seAsia` happen + to line up correctly with matching values in `country`. +- The **clunky** (but technically correct) way to do this problem is + +```{r, results="hide"} + (countries=="Myanmar" | countries=="Thailand" | + countries=="Cambodia" | countries == "Vietnam" | countries=="Laos") +``` + +(or `countries==seAsia[1] | countries==seAsia[2] | ...`). This +gives the correct values, but hopefully you can see how awkward it +is (what if we wanted to select countries from a much longer list?). + +- The best way to do this problem is `countries %in% seAsia`, which + is both correct and easy to type (and read). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Handling special values + +At some point you will encounter functions in R that cannot handle missing, infinite, +or undefined data. + +There are a number of special functions you can use to filter out this data: + +- `is.na` will return all positions in a vector, matrix, or data.frame + containing `NA` (or `NaN`) +- likewise, `is.nan`, and `is.infinite` will do the same for `NaN` and `Inf`. +- `is.finite` will return all positions in a vector, matrix, or data.frame + that do not contain `NA`, `NaN` or `Inf`. +- `na.omit` will filter out all missing values from a vector + +## Factor subsetting + +Now that we've explored the different ways to subset vectors, how +do we subset the other data structures? + +Factor subsetting works the same way as vector subsetting. + +```{r} +f <- factor(c("a", "a", "b", "c", "c", "d")) +f[f == "a"] +f[f %in% c("b", "c")] +f[1:3] +``` + +Skipping elements will not remove the level +even if no more of that category exists in the factor: + +```{r} +f[-3] +``` + +## Matrix subsetting + +Matrices are also subsetted using the `[` function. In this case +it takes two arguments: the first applying to the rows, the second +to its columns: + +```{r} +set.seed(1) +m <- matrix(rnorm(6*4), ncol=4, nrow=6) +m[3:4, c(3,1)] +``` + +You can leave the first or second arguments blank to retrieve all the +rows or columns respectively: + +```{r} +m[, c(3,4)] +``` + +If we only access one row or column, R will automatically convert the result +to a vector: + +```{r} +m[3,] +``` + +If you want to keep the output as a matrix, you need to specify a _third_ argument; +`drop = FALSE`: + +```{r} +m[3, , drop=FALSE] +``` + +Unlike vectors, if we try to access a row or column outside of the matrix, +R will throw an error: + +```{r, error=TRUE} +m[, c(3,6)] +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Higher dimensional arrays + +when dealing with multi-dimensional arrays, each argument to `[` +corresponds to a dimension. For example, a 3D array, the first three +arguments correspond to the rows, columns, and depth dimension. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Because matrices are vectors, we can +also subset using only one argument: + +```{r} +m[5] +``` + +This usually isn't useful, and often confusing to read. However it is useful to note that matrices +are laid out in _column-major format_ by default. That is the elements of the +vector are arranged column-wise: + +```{r} +matrix(1:6, nrow=2, ncol=3) +``` + +If you wish to populate the matrix by row, use `byrow=TRUE`: + +```{r} +matrix(1:6, nrow=2, ncol=3, byrow=TRUE) +``` + +Matrices can also be subsetted using their rownames and column names +instead of their row and column indices. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Given the following code: + +```{r} +m <- matrix(1:18, nrow=3, ncol=6) +print(m) +``` + +1. Which of the following commands will extract the values 11 and 14? + +A. `m[2,4,2,5]` + +B. `m[2:5]` + +C. `m[4:5,2]` + +D. `m[2,c(4,5)]` + +::::::::::::::: solution + +## Solution to challenge 4 + +D + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## List subsetting + +Now we'll introduce some new subsetting operators. There are three functions +used to subset lists. We've already seen these when learning about atomic vectors and matrices: `[`, `[[`, and `$`. + +Using `[` will always return a list. If you want to _subset_ a list, but not +_extract_ an element, then you will likely use `[`. + +```{r} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +xlist[1] +``` + +This returns a _list with one element_. + +We can subset elements of a list exactly the same way as atomic +vectors using `[`. Comparison operations however won't work as +they're not recursive, they will try to condition on the data structures +in each element of the list, not the individual elements within those +data structures. + +```{r} +xlist[1:2] +``` + +To extract individual elements of a list, you need to use the double-square +bracket function: `[[`. + +```{r} +xlist[[1]] +``` + +Notice that now the result is a vector, not a list. + +You can't extract more than one element at once: + +```{r, error=TRUE} +xlist[[1:2]] +``` + +Nor use it to skip elements: + +```{r, error=TRUE} +xlist[[-1]] +``` + +But you can use names to both subset and extract elements: + +```{r} +xlist[["a"]] +``` + +The `$` function is a shorthand way for extracting elements by name: + +```{r} +xlist$data +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Given the following list: + +```{r, eval=FALSE} +xlist <- list(a = "Software Carpentry", b = 1:10, data = head(mtcars)) +``` + +Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. +Hint: the number 2 is contained within the "b" item in the list. + +::::::::::::::: solution + +## Solution to challenge 5 + +```{r} +xlist$b[2] +``` + +```{r} +xlist[[2]][2] +``` + +```{r} +xlist[["b"]][2] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 6 + +Given a linear model: + +```{r, eval=FALSE} +mod <- aov(pop ~ lifeExp, data=gapminder) +``` + +Extract the residual degrees of freedom (hint: `attributes()` will help you) + +::::::::::::::: solution + +## Solution to challenge 6 + +```{r, eval=FALSE} +attributes(mod) ## `df.residual` is one of the names of `mod` +``` + +```{r, eval=FALSE} +mod$df.residual +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Data frames + +Remember the data frames are lists underneath the hood, so similar rules +apply. However they are also two dimensional objects: + +`[` with one argument will act the same way as for lists, where each list +element corresponds to a column. The resulting object will be a data frame: + +```{r} +head(gapminder[3]) +``` + +Similarly, `[[` will act to extract _a single column_: + +```{r} +head(gapminder[["lifeExp"]]) +``` + +And `$` provides a convenient shorthand to extract columns by name: + +```{r} +head(gapminder$year) +``` + +With two arguments, `[` behaves the same way as for matrices: + +```{r} +gapminder[1:3,] +``` + +If we subset a single row, the result will be a data frame (because +the elements are mixed types): + +```{r} +gapminder[3,] +``` + +But for a single column the result will be a vector (this can +be changed with the third argument, `drop = FALSE`). + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +gapminder[gapminder$year = 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +gapminder[,-1:4] +``` + +3. Extract the rows where the life expectancy is longer the 80 years + +```{r, eval=FALSE} +gapminder[gapminder$lifeExp > 80] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +gapminder[1, 4, 5] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +gapminder[gapminder$year == 2002 | 2007,] +``` + +::::::::::::::: solution + +## Solution to challenge 7 + +Fix each of the following common data frame subsetting errors: + +1. Extract observations collected for the year 1957 + +```{r, eval=FALSE} +# gapminder[gapminder$year = 1957,] +gapminder[gapminder$year == 1957,] +``` + +2. Extract all columns except 1 through to 4 + +```{r, eval=FALSE} +# gapminder[,-1:4] +gapminder[,-c(1:4)] +``` + +3. Extract the rows where the life expectancy is longer than 80 years + +```{r, eval=FALSE} +# gapminder[gapminder$lifeExp > 80] +gapminder[gapminder$lifeExp > 80,] +``` + +4. Extract the first row, and the fourth and fifth columns + (`continent` and `lifeExp`). + +```{r, eval=FALSE} +# gapminder[1, 4, 5] +gapminder[1, c(4, 5)] +``` + +5. Advanced: extract rows that contain information for the years 2002 + and 2007 + +```{r, eval=FALSE} +# gapminder[gapminder$year == 2002 | 2007,] +gapminder[gapminder$year == 2002 | gapminder$year == 2007,] +gapminder[gapminder$year %in% c(2002, 2007),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 8 + +1. Why does `gapminder[1:20]` return an error? How does it differ from `gapminder[1:20, ]`? + +2. Create a new `data.frame` called `gapminder_small` that only contains rows 1 through 9 + and 19 through 23. You can do this in one or two steps. + +::::::::::::::: solution + +## Solution to challenge 8 + +1. `gapminder` is a data.frame so needs to be subsetted on two dimensions. `gapminder[1:20, ]` subsets the data to give the first 20 rows and all columns. + +2. + +```{r} +gapminder_small <- gapminder[c(1:9, 19:23),] +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Indexing in R starts at 1, not 0. +- Access individual values by location using `[]`. +- Access slices of data using `[low:high]`. +- Access arbitrary sets of data using `[c(...)]`. +- Use logical operations and logical vectors to access subsets of data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/07-control-flow.Rmd b/locale/uk/episodes/07-control-flow.Rmd new file mode 100644 index 000000000..39946a2c4 --- /dev/null +++ b/locale/uk/episodes/07-control-flow.Rmd @@ -0,0 +1,565 @@ +--- +title: Control Flow +teaching: 45 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Write conditional statements with `if...else` statements and `ifelse()`. +- Write and understand `for()` loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I make data-dependent choices in R? +- How can I repeat operations in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +set.seed(10) +``` + +Often when we're coding we want to control the flow of our actions. This can be done +by setting actions to occur only if a condition or a set of conditions are met. +Alternatively, we can also set an action to occur a particular number of times. + +There are several ways you can control flow in R. +For conditional statements, the most commonly used approaches are the constructs: + +```{r, eval=FALSE} +# if +if (condition is true) { + perform action +} + +# if ... else +if (condition is true) { + perform action +} else { # that is, if the condition is false, + perform alternative action +} +``` + +Say, for example, that we want R to print a message if a variable `x` has a particular value: + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} + +x +``` + +The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an `else` statement. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else { + print("x is less than 10") +} +``` + +You can also test multiple conditions by using `else if`. + +```{r} +x <- 8 + +if (x >= 10) { + print("x is greater than or equal to 10") +} else if (x > 5) { + print("x is greater than 5, but less than 10") +} else { + print("x is less than 5") +} +``` + +**Important:** when R evaluates the condition inside `if()` statements, it is +looking for a logical element, i.e., `TRUE` or `FALSE`. This can cause some +headaches for beginners. For example: + +```{r} +x <- 4 == 3 +if (x) { + "4 equals 3" +} else { + "4 does not equal 3" +} +``` + +As we can see, the not equal message was printed because the vector x is `FALSE` + +```{r} +x <- 4 == 3 +x +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Use an `if()` statement to print a suitable message +reporting whether there are any records from 2002 in +the `gapminder` dataset. +Now do the same for 2012. + +::::::::::::::: solution + +## Solution to Challenge 1 + +We will first see a solution to Challenge 1 which does not use the `any()` function. +We first obtain a logical vector describing which element of `gapminder$year` is equal to `2002`: + +```{r ch10pt1-sol, eval=FALSE} +gapminder[(gapminder$year == 2002),] +``` + +Then, we count the number of rows of the data.frame `gapminder` that correspond to the 2002: + +```{r ch10pt2-sol, eval=FALSE} +rows2002_number <- nrow(gapminder[(gapminder$year == 2002),]) +``` + +The presence of any record for the year 2002 is equivalent to the request that `rows2002_number` is one or more: + +```{r ch10pt3-sol, eval=FALSE} +rows2002_number >= 1 +``` + +Putting all together, we obtain: + +```{r ch10pt4-sol, eval=FALSE} +if(nrow(gapminder[(gapminder$year == 2002),]) >= 1){ + print("Record(s) for the year 2002 found.") +} +``` + +All this can be done more quickly with `any()`. The logical condition can be expressed as: + +```{r ch10pt5-sol, eval=FALSE} +if(any(gapminder$year == 2002)){ + print("Record(s) for the year 2002 found.") +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Did anyone get a warning message like this? + +```{r, echo=FALSE} +if (gapminder$year == 2012) {} +``` + +The `if()` function only accepts singular (of length 1) inputs, and therefore +returns an error when you use it with a vector. The `if()` function will still +run, but will only evaluate the condition in the first element of the vector. +Therefore, to use the `if()` function, you need to make sure your input is +singular (of length 1). + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Built in `ifelse()` function + +`R` accepts both `if()` and `else if()` statements structured as outlined above, +but also statements using `R`'s built-in `ifelse()` function. This +function accepts both singular and vector inputs and is structured as +follows: + +```{r, eval=FALSE} +# ifelse function +ifelse(condition is true, perform action, perform alternative action) + +``` + +where the first argument is the condition or a set of conditions to be met, the +second argument is the statement that is evaluated when the condition is `TRUE`, +and the third statement is the statement that is evaluated when the condition +is `FALSE`. + +```{r} +y <- -3 +ifelse(y < 0, "y is a negative number", "y is either positive or zero") + +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: `any()` and `all()` + +The `any()` function will return `TRUE` if at least one +`TRUE` value is found within a vector, otherwise it will return `FALSE`. +This can be used in a similar way to the `%in%` operator. +The function `all()`, as the name suggests, will only return `TRUE` if all values in +the vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Repeating operations + +If you want to iterate over +a set of values, when the order of iteration is important, and perform the +same operation on each, a `for()` loop will do the job. +We saw `for()` loops in the [shell lessons earlier](https://swcarpentry.github.io/shell-novice/05-loop.html). This is the most +flexible of looping operations, but therefore also the hardest to use +correctly. In general, the advice of many `R` users would be to learn about +`for()` loops, but to avoid using `for()` loops unless the order of iteration is +important: i.e. the calculation at each iteration depends on the results of +previous iterations. If the order of iteration is not important, then you +should learn about vectorized alternatives, such as the `purrr` package, as they +pay off in computational efficiency. + +The basic structure of a `for()` loop is: + +```{r, eval=FALSE} +for (iterator in set of values) { + do a thing +} +``` + +For example: + +```{r} +for (i in 1:10) { + print(i) +} +``` + +The `1:10` bit creates a vector on the fly; you can iterate +over any other vector as well. + +We can use a `for()` loop nested within another `for()` loop to iterate over two things at +once. + +```{r} +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + print(paste(i,j)) + } +} +``` + +We notice in the output that when the first index (`i`) is set to 1, the second +index (`j`) iterates through its full set of indices. Once the indices of `j` +have been iterated through, then `i` is incremented. This process continues +until the last index has been used for each `for()` loop. + +Rather than printing the results, we could write the loop output to a new object. + +```{r} +output_vector <- c() +for (i in 1:5) { + for (j in c('a', 'b', 'c', 'd', 'e')) { + temp_output <- paste(i, j) + output_vector <- c(output_vector, temp_output) + } +} +output_vector +``` + +This approach can be useful, but 'growing your results' (building +the result object incrementally) is computationally inefficient, so avoid +it when you are iterating through a lot of values. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: don't grow your results + +One of the biggest things that trips up novices and +experienced R users alike, is building a results object +(vector, list, matrix, data frame) as your for loop progresses. +Computers are very bad at handling this, so your calculations +can very quickly slow to a crawl. It's much better to define +an empty results object before hand of appropriate dimensions, rather +than initializing an empty object without dimensions. +So if you know the end result will be stored in a matrix like above, +create an empty matrix with 5 row and 5 columns, then at each iteration +store the results in the appropriate location. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +A better way is to define your (empty) output object before filling in the values. +For this example, it looks more involved, but is still more efficient. + +```{r} +output_matrix <- matrix(nrow = 5, ncol = 5) +j_vector <- c('a', 'b', 'c', 'd', 'e') +for (i in 1:5) { + for (j in 1:5) { + temp_j_value <- j_vector[j] + temp_output <- paste(i, temp_j_value) + output_matrix[i, j] <- temp_output + } +} +output_vector2 <- as.vector(output_matrix) +output_vector2 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: While loops + +Sometimes you will find yourself needing to repeat an operation as long as a certain +condition is met. You can do this with a `while()` loop. + +```{r, eval=FALSE} +while(this condition is true){ + do a thing +} +``` + +R will interpret a condition being met as "TRUE". + +As an example, here's a while loop +that generates random numbers from a uniform distribution (the `runif()` function) +between 0 and 1 until it gets one that's less than 0.1. + +```r +z <- 1 +while(z > 0.1){ + z <- runif(1) + cat(z, "\n") +} +``` + +`while()` loops will not always be appropriate. You have to be particularly careful +that you don't end up stuck in an infinite loop because your condition is always met and hence the while statement never terminates. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Compare the objects `output_vector` and +`output_vector2`. Are they the same? If not, why not? +How would you change the last block of code to make `output_vector2` +the same as `output_vector`? + +::::::::::::::: solution + +## Solution to Challenge 2 + +We can check whether the two vectors are identical using the `all()` function: + +```{r ch10pt6-sol, eval=FALSE} +all(output_vector == output_vector2) +``` + +However, all the elements of `output_vector` can be found in `output_vector2`: + +```{r ch10pt7-sol, eval=FALSE} +all(output_vector %in% output_vector2) +``` + +and vice versa: + +```{r ch10pt8-sol, eval=FALSE} +all(output_vector2 %in% output_vector) +``` + +therefore, the element in `output_vector` and `output_vector2` are just sorted in a different order. +This is because `as.vector()` outputs the elements of an input matrix going over its column. +Taking a look at `output_matrix`, we can notice that we want its elements by rows. +The solution is to transpose the `output_matrix`. We can do it either by calling the transpose function +`t()` or by inputting the elements in the right order. +The first solution requires to change the original + +```{r ch10pt9-sol, eval=FALSE} +output_vector2 <- as.vector(output_matrix) +``` + +into + +```{r ch10pt10-sol, eval=FALSE} +output_vector2 <- as.vector(t(output_matrix)) +``` + +The second solution requires to change + +```{r ch10pt11-sol, eval=FALSE} +output_matrix[i, j] <- temp_output +``` + +into + +```{r ch10pt12-sol, eval=FALSE} +output_matrix[j, i] <- temp_output +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Write a script that loops through the `gapminder` data by continent and prints out +whether the mean life expectancy is smaller or larger than 50 +years. + +::::::::::::::: solution + +## Solution to Challenge 3 + +**Step 1**: We want to make sure we can extract all the unique values of the continent vector + +```{r 07-chall-03-sol-a, eval=FALSE} +gapminder <- read.csv("data/gapminder_data.csv") +unique(gapminder$continent) +``` + +**Step 2**: We also need to loop over each of these continents and calculate the average life expectancy for each `subset` of data. +We can do that as follows: + +1. Loop over each of the unique values of 'continent' +2. For each value of continent, create a temporary variable storing that subset +3. Return the calculated life expectancy to the user by printing the output: + +```{r 07-chall-03-sol-b, eval=FALSE} +for (iContinent in unique(gapminder$continent)) { + tmp <- gapminder[gapminder$continent == iContinent, ] + cat(iContinent, mean(tmp$lifeExp, na.rm = TRUE), "\n") + rm(tmp) +} +``` + +**Step 3**: The exercise only wants the output printed if the average life expectancy is less than 50 or greater than 50. +So we need to add an `if()` condition before printing, which evaluates whether the calculated average life expectancy is above or below a threshold, and prints an output conditional on the result. +We need to amend (3) from above: + +3a. If the calculated life expectancy is less than some threshold (50 years), return the continent and a statement that life expectancy is less than threshold, otherwise return the continent and a statement that life expectancy is greater than threshold: + +```{r 07-chall-03-sol-c, eval=FALSE} +thresholdValue <- 50 + +for (iContinent in unique(gapminder$continent)) { + tmp <- mean(gapminder[gapminder$continent == iContinent, "lifeExp"]) + + if (tmp < thresholdValue){ + cat("Average Life Expectancy in", iContinent, "is less than", thresholdValue, "\n") + } else { + cat("Average Life Expectancy in", iContinent, "is greater than", thresholdValue, "\n") + } # end if else condition + rm(tmp) +} # end for loop + +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Modify the script from Challenge 3 to loop over each +country. This time print out whether the life expectancy is +smaller than 50, between 50 and 70, or greater than 70. + +::::::::::::::: solution + +## Solution to Challenge 4 + +We modify our solution to Challenge 3 by now adding two thresholds, `lowerThreshold` and `upperThreshold` and extending our if-else statements: + +```{r 07-chall-04-sol, eval=FALSE} + lowerThreshold <- 50 + upperThreshold <- 70 + +for (iCountry in unique(gapminder$country)) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if(tmp < lowerThreshold) { + cat("Average Life Expectancy in", iCountry, "is less than", lowerThreshold, "\n") + } else if(tmp > lowerThreshold && tmp < upperThreshold) { + cat("Average Life Expectancy in", iCountry, "is between", lowerThreshold, "and", upperThreshold, "\n") + } else { + cat("Average Life Expectancy in", iCountry, "is greater than", upperThreshold, "\n") + } + rm(tmp) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 - Advanced + +Write a script that loops over each country in the `gapminder` dataset, +tests whether the country starts with a 'B', and graphs life expectancy +against time as a line graph if the mean life expectancy is under 50 years. + +::::::::::::::: solution + +## Solution for Challenge 5 + +We will use the `grep()` command that was introduced in the [Unix Shell lesson](https://swcarpentry.github.io/shell-novice/07-find.html) +to find countries that start with "B." +Lets understand how to do this first. +Following from the Unix shell section we may be tempted to try the following + +```{r 07-chall-05-sol-a, eval=FALSE} +grep("^B", unique(gapminder$country)) +``` + +But when we evaluate this command it returns the indices of the factor variable `country` that start with "B." +To get the values, we must add the `value=TRUE` option to the `grep()` command: + +```{r 07-chall-05-sol-b, eval=FALSE} +grep("^B", unique(gapminder$country), value = TRUE) +``` + +We will now store these countries in a variable called candidateCountries, and then loop over each entry in the variable. +Inside the loop, we evaluate the average life expectancy for each country, and if the average life expectancy is less than 50 we use base-plot to plot the evolution of average life expectancy using `with()` and `subset()`: + +```{r 07-chall-05-sol-c, eval=FALSE} +thresholdValue <- 50 +candidateCountries <- grep("^B", unique(gapminder$country), value = TRUE) + +for (iCountry in candidateCountries) { + tmp <- mean(gapminder[gapminder$country == iCountry, "lifeExp"]) + + if (tmp < thresholdValue) { + cat("Average Life Expectancy in", iCountry, "is less than", thresholdValue, "plotting life expectancy graph... \n") + + with(subset(gapminder, country == iCountry), + plot(year, lifeExp, + type = "o", + main = paste("Life Expectancy in", iCountry, "over time"), + ylab = "Life Expectancy", + xlab = "Year" + ) # end plot + ) # end with + } # end if + rm(tmp) +} # end for loop +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `if` and `else` to make choices. +- Use `for` to repeat operations. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/08-plot-ggplot2.Rmd b/locale/uk/episodes/08-plot-ggplot2.Rmd new file mode 100644 index 000000000..12998dc14 --- /dev/null +++ b/locale/uk/episodes/08-plot-ggplot2.Rmd @@ -0,0 +1,471 @@ +--- +title: Creating Publication-Quality Graphics with ggplot2 +teaching: 60 +exercises: 20 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use ggplot2 to generate publication-quality graphics. +- To apply geometry, aesthetic, and statistics layers to a ggplot plot. +- To manipulate the aesthetics of a plot using different colors, shapes, and lines. +- To improve data visualization through transforming scales and paneling by group. +- To save a plot created with ggplot to disk. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I create publication-quality graphics in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Plotting our data is one of the best ways to +quickly explore it and the various relationships +between variables. + +There are three main plotting systems in R, +the [base plotting system][base], the [lattice] +package, and the [ggplot2] package. + +Today we'll be learning about the ggplot2 package, because +it is the most effective for creating publication-quality +graphics. + +ggplot2 is built on the grammar of graphics, the idea that any plot can be +built from the same set of components: a **data set**, +**mapping aesthetics**, and graphical **layers**: + +- **Data sets** are the data that you, the user, provide. + +- **Mapping aesthetics** are what connect the data to the graphics. + They tell ggplot2 how to use your data to affect how the graph looks, + such as changing what is plotted on the X or Y axis, or the size or + color of different data points. + +- **Layers** are the actual graphical output from ggplot2. Layers + determine what kinds of plot are shown (scatterplot, histogram, etc.), + the coordinate system used (rectangular, polar, others), and other + important aspects of the plot. The idea of layers of graphics may + be familiar to you if you have used image editing programs + like Photoshop, Illustrator, or Inkscape. + +Let's start off building an example using the gapminder data from earlier. +The most basic function is `ggplot`, which lets R know that we're +creating a new plot. Any of the arguments we give the `ggplot` +function are the _global_ options for the plot: they apply to all +layers on the plot. + +```{r blank-ggplot, message=FALSE, fig.alt="Blank plot, before adding any mapping aesthetics to ggplot()."} +library("ggplot2") +ggplot(data = gapminder) +``` + +Here we called `ggplot` and told it what data we want to show on +our figure. This is not enough information for `ggplot` to actually +draw anything. It only creates a blank slate for other elements +to be added to. + +Now we're going to add in the **mapping aesthetics** using the +`aes` function. `aes` tells `ggplot` how variables in the **data** +map to _aesthetic_ properties of the figure, such as which columns +of the data should be used for the **x** and **y** locations. + +```{r ggplot-with-aes, message=FALSE, fig.alt="Plotting area with axes for a scatter plot of life expectancy vs GDP, with no data points visible."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +``` + +Here we told `ggplot` we want to plot the "gdpPercap" column of the +gapminder data frame on the x-axis, and the "lifeExp" column on the +y-axis. Notice that we didn't need to explicitly pass `aes` these +columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because +`ggplot` is smart enough to know to look in the **data** for that column! + +The final part of making our plot is to tell `ggplot` how we want to +visually represent the data. We do this by adding a new **layer** +to the plot using one of the **geom** functions. + +```{r lifeExp-vs-gdpPercap-scatter, message=FALSE, fig.alt="Scatter plot of life expectancy vs GDP per capita, now showing the data points."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Here we used `geom_point`, which tells `ggplot` we want to visually +represent the relationship between **x** and **y** as a scatterplot of points. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Modify the example so that the figure shows how life expectancy has +changed over time: + +```{r, eval=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point() +``` + +Hint: the gapminder dataset has a column called "year", which should appear +on the x-axis. + +::::::::::::::: solution + +## Solution to challenge 1 + +Here is one possible solution: + +```{r ch1-sol, fig.cap="Binned scatterplot of life expectancy versus year showing how life expectancy has increased over time"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +In the previous examples and challenge we've used the `aes` function to tell +the scatterplot **geom** about the **x** and **y** locations of each point. +Another _aesthetic_ property we can modify is the point _color_. Modify the +code from the previous challenge to **color** the points by the "continent" +column. What trends do you see in the data? Are they what you expected? + +::::::::::::::: solution + +## Solution to challenge 2 + +The solution presented below adds `color=continent` to the call of the `aes` +function. The general trend seems to indicate an increased life expectancy +over the years. On continents with stronger economies we find a longer life +expectancy. + +```{r ch2-sol, fig.cap="Binned scatterplot of life expectancy vs year with color-coded continents showing value of 'aes' function"} +ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Layers + +Using a scatterplot probably isn't the best for visualizing change over time. +Instead, let's tell `ggplot` to visualize the data as a line plot: + +```{r lifeExp-line} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) + + geom_line() +``` + +Instead of adding a `geom_point` layer, we've added a `geom_line` layer. + +However, the result doesn't look quite as we might have expected: it seems to be jumping around a lot in each continent. Let's try to separate the data by country, plotting one line for each country: + +```{r lifeExp-line-by} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() +``` + +We've added the **group** _aesthetic_, which tells `ggplot` to draw a line for each +country. + +But what if we want to visualize both lines and points on the plot? We can +add another layer to the plot: + +```{r lifeExp-line-point} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + + geom_line() + geom_point() +``` + +It's important to note that each layer is drawn on top of the previous layer. In +this example, the points have been drawn _on top of_ the lines. Here's a +demonstration: + +```{r lifeExp-layer-example-1} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_line(mapping = aes(color=continent)) + geom_point() +``` + +In this example, the _aesthetic_ mapping of **color** has been moved from the +global plot options in `ggplot` to the `geom_line` layer so it no longer applies +to the points. Now we can clearly see that the points are drawn on top of the +lines. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Setting an aesthetic to a value instead of a mapping + +So far, we've seen how to use an aesthetic (such as **color**) as a _mapping_ to a variable in the data. For example, when we use `geom_line(mapping = aes(color=continent))`, ggplot will give a different color to each continent. But what if we want to change the color of all lines to blue? You may think that `geom_line(mapping = aes(color="blue"))` should work, but it doesn't. Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the `aes()` function, like this: `geom_line(color="blue")`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Switch the order of the point and line layers from the previous example. What +happened? + +::::::::::::::: solution + +## Solution to challenge 3 + +The lines now get drawn over the points! + +```{r ch3-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + + geom_point() + geom_line(mapping = aes(color=continent)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Transformations and statistics + +ggplot2 also makes it easy to overlay statistical models over the data. To +demonstrate we'll go back to our first example: + +```{r lifeExp-vs-gdpPercap-scatter3, message=FALSE} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point() +``` + +Currently it's hard to see the relationship between the points due to some strong +outliers in GDP per capita. We can change the scale of units on the x axis using +the _scale_ functions. These control the mapping between the data values and +visual values of an aesthetic. We can also modify the transparency of the +points, using the _alpha_ function, which is especially helpful when you have +a large amount of data which is very clustered. + +```{r axis-scale, fig.cap="Scatterplot of GDP vs life expectancy showing logarithmic x-axis data spread"} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() +``` + +The `scale_x_log10` function applied a transformation to the coordinate system of the plot, so that each multiple of 10 is evenly spaced from left to right. For example, a GDP per capita of 1,000 is the same horizontal distance away from a value of 10,000 as the 10,000 value is from 100,000. This helps to visualize the spread of the data along the x-axis. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip Reminder: Setting an aesthetic to a value instead of a mapping + +Notice that we used `geom_point(alpha = 0.5)`. As the previous tip mentioned, using a setting outside of the `aes()` function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, _alpha_ can also be mapped to a variable in the data. For example, we can give a different transparency to each continent with `geom_point(mapping = aes(alpha = continent))`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +We can fit a simple relationship to the data by adding another layer, +`geom_smooth`: + +```{r lm-fit, fig.alt="Scatter plot of life expectancy vs GDP per capita with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm") +``` + +We can make the line thicker by _setting_ the **linewidth** aesthetic in the +`geom_smooth` layer: + +```{r lm-fit2, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The blue trend line is slightly thicker than in the previous figure."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", linewidth=1.5) +``` + +There are two ways an _aesthetic_ can be specified. Here we _set_ the **linewidth** aesthetic by passing it as an argument to `geom_smooth` and it is applied the same to the whole `geom`. Previously in the lesson we've used the `aes` function to define a _mapping_ between data variables and their visual representation. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4a + +Modify the color and size of the points on the point layer in the previous +example. + +Hint: do not use the `aes` function. + +Hint: the equivalent of `linewidth` for points is `size`. + +::::::::::::::: solution + +## Solution to challenge 4a + +Here a possible solution: +Notice that the `color` argument is supplied outside of the `aes()` function. +This means that it applies to all data points on the graph and is not related to +a specific variable. + +```{r ch4a-sol, fig.alt="Scatter plot of life expectancy vs GDP per capita with a trend line summarising the relationship between variables. The plot illustrates the possibilities for styling visualisations in ggplot2 with data points enlarged, coloured orange, and displayed without transparency."} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + + geom_point(size=3, color="orange") + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4b + +Modify your solution to Challenge 4a so that the +points are now a different shape and are colored by continent with new +trendlines. Hint: The color argument can be used inside the aesthetic. + +::::::::::::::: solution + +## Solution to challenge 4b + +Here is a possible solution: +Notice that supplying the `color` argument inside the `aes()` functions enables you to +connect it to a certain variable. The `shape` argument, as you can see, modifies all +data points the same way (it is outside the `aes()` call) while the `color` argument which +is placed inside the `aes()` call modifies a point's color based on its continent value. + +```{r ch4b-sol} +ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + + geom_point(size=3, shape=17) + scale_x_log10() + + geom_smooth(method="lm", linewidth=1.5) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Multi-panel figures + +Earlier we visualized the change in life expectancy over time across all +countries in one plot. Alternatively, we can split this out over multiple panels +by adding a layer of **facet** panels. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +We start by making a subset of data including only countries located +in the Americas. This includes 25 countries, which will begin to +clutter the figure. Note that we apply a "theme" definition to rotate +the x-axis labels to maintain readability. Nearly everything in +ggplot2 is customizable. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r facet} +americas <- gapminder[gapminder$continent == "Americas",] +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +The `facet_wrap` layer took a "formula" as its argument, denoted by the tilde +(~). This tells R to draw a panel for each unique value in the country column +of the gapminder dataset. + +## Modifying text + +To clean this figure up for a publication we need to change some of the text +elements. The x-axis is too cluttered, and the y axis should read +"Life expectancy", rather than the column name in the data frame. + +We can do this by adding a couple of different layers. The **theme** layer +controls the axis text, and overall text size. Labels for the axes, plot +title and any legend can be set using the `labs` function. Legend titles +are set using the same names we used in the `aes` specification. Thus below +the color legend title is set using `color = "Continent"`, while the title +of a fill legend would be set using `fill = "MyTitle"`. + +```{r theme} +ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +``` + +## Exporting the plot + +The `ggsave()` function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable `lifeExp_plot`, then tell `ggsave` to save that plot in `png` format to a directory called `results`. (Make sure you have a `results/` folder in your working directory.) + +```{r directory-check, echo=FALSE} +if (!dir.exists("results")) { + dir.create("results") +} +``` + +```{r save} +lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + + geom_line() + facet_wrap( ~ country) + + labs( + x = "Year", # x axis title + y = "Life expectancy", # y axis title + title = "Figure 1", # main title of figure + color = "Continent" # title of legend + ) + + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + +ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm") +``` + +There are two nice things about `ggsave`. First, it defaults to the last plot, so if you omit the `plot` argument it will automatically save the last plot you created with `ggplot`. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example `.png` or `.pdf`). If you need to, you can specify the format explicitly in the `device` argument. + +This is a taste of what you can do with ggplot2. RStudio provides a +really useful [cheat sheet][cheat] of the different layers available, and more +extensive documentation is available on the [ggplot2 website][ggplot-doc]. All RStudio cheat sheets are available from the [RStudio website][cheat_all]. +Finally, if you have no idea how to change something, a quick Google search will +usually send you to a relevant question and answer on Stack Overflow with reusable +code to modify! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +Generate boxplots to compare life expectancy between the different continents during the available years. + +Advanced: + +- Rename y axis as Life Expectancy. +- Remove x axis labels. + +::::::::::::::: solution + +## Solution to Challenge 5 + +Here a possible solution: +`xlab()` and `ylab()` set labels for the x and y axes, respectively +The axis title, text and ticks are attributes of the theme and must be modified within a `theme()` call. + +```{r ch5-sol} +ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) + + geom_boxplot() + facet_wrap(~year) + + ylab("Life Expectancy") + + theme(axis.title.x=element_blank(), + axis.text.x = element_blank(), + axis.ticks.x = element_blank()) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[base]: https://www.statmethods.net/graphs/index.html +[lattice]: https://www.statmethods.net/advgraphs/trellis.html +[ggplot2]: https://www.statmethods.net/advgraphs/ggplot2.html +[cheat]: https://www.rstudio.org/links/data_visualization_cheat_sheet +[cheat_all]: https://www.rstudio.com/resources/cheatsheets/ +[ggplot-doc]: https://ggplot2.tidyverse.org/reference/ + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `ggplot2` to create plots. +- Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/09-vectorization.Rmd b/locale/uk/episodes/09-vectorization.Rmd new file mode 100644 index 000000000..9cae732ed --- /dev/null +++ b/locale/uk/episodes/09-vectorization.Rmd @@ -0,0 +1,332 @@ +--- +title: Vectorization +teaching: 10 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand vectorized operations in R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I operate on all the elements of a vector at once? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +library("ggplot2") +``` + +Most of R's functions are vectorized, meaning that the function will +operate on all elements of a vector without needing to loop through +and act on each element one at a time. This makes writing code more +concise, easy to read, and less error prone. + +```{r} +x <- 1:4 +x * 2 +``` + +The multiplication happened to each element of the vector. + +We can also add two vectors together: + +```{r} +y <- 6:9 +x + y +``` + +Each element of `x` was added to its corresponding element of `y`: + +```{r, eval=FALSE} +x: 1 2 3 4 + + + + + +y: 6 7 8 9 +--------------- + 7 9 11 13 +``` + +Here is how we would add two vectors together using a for loop: + +```{r} +output_vector <- c() +for (i in 1:4) { + output_vector[i] <- x[i] + y[i] +} +output_vector + + +``` + +Compare this to the output using vectorised operations. + +```{r} +sum_xy <- x + y +sum_xy +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +::::::::::::::: solution + +## Solution to challenge 1 + +Let's try this on the `pop` column of the `gapminder` dataset. + +Make a new column in the `gapminder` data frame that +contains population in units of millions of people. +Check the head or tail of the data frame to make sure +it worked. + +```{r} +gapminder$pop_millions <- gapminder$pop / 1e6 +head(gapminder) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +On a single graph, plot population, in +millions, against year, for all countries. Do not worry about +identifying which country is which. + +Repeat the exercise, graphing only for China, India, and +Indonesia. Again, do not worry about which is which. + +::::::::::::::: solution + +## Solution to challenge 2 + +Refresh your plotting skills by plotting population in millions against year. + +```{r ch2-sol, fig.alt="Scatter plot showing populations in the millions against the year for China, India, and Indonesia, countries are not labeled."} +ggplot(gapminder, aes(x = year, y = pop_millions)) + + geom_point() +countryset <- c("China","India","Indonesia") +ggplot(gapminder[gapminder$country %in% countryset,], + aes(x = year, y = pop_millions)) + + geom_point() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Comparison operators, logical operators, and many functions are also +vectorized: + +**Comparison operators** + +```{r} +x > 2 +``` + +**Logical operators** + +```{r} +a <- x > 3 # or, for clarity, a <- (x > 3) +a +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: some useful functions for logical vectors + +`any()` will return `TRUE` if _any_ element of a vector is `TRUE`.\ +`all()` will return `TRUE` if _all_ elements of a vector are `TRUE`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Most functions also operate element-wise on vectors: + +**Functions** + +```{r} +x <- 1:4 +log(x) +``` + +Vectorized operations work element-wise on matrices: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m * -1 +``` + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: element-wise vs. matrix multiplication + +Very important: the operator `*` gives you element-wise multiplication! +To do matrix multiplication, we need to use the `%*%` operator: + +```{r} +m %*% matrix(1, nrow=4, ncol=1) +matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1) +``` + +For more on matrix algebra, see the Quick-R reference +guide + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` +2. `m * c(1, 0, -1)` +3. `m > c(0, 20)` +4. `m * c(1, 0, -1, 2)` + +Did you get the output you expected? If not, ask a helper! + +::::::::::::::: solution + +## Solution to challenge 3 + +Given the following matrix: + +```{r} +m <- matrix(1:12, nrow=3, ncol=4) +m +``` + +Write down what you think will happen when you run: + +1. `m ^ -1` + +```{r, echo=FALSE} +m ^ -1 +``` + +2. `m * c(1, 0, -1)` + +```{r, echo=FALSE} +m * c(1, 0, -1) +``` + +3. `m > c(0, 20)` + +```{r, echo=FALSE} +m > c(0, 20) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for high values of +n. Use vectorisation to compute x when n=100. What is the sum when +n=10,000? + +::::::::::::::: solution + +## Challenge 4 + +We're interested in looking at the sum of the +following sequence of fractions: + +```{r, eval=FALSE} + x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2) +``` + +This would be tedious to type out, and impossible for +high values of n. +Can you use vectorisation to compute x, when n=100? +How about when n=10,000? + +```{r} +sum(1/(1:100)^2) +sum(1/(1:1e04)^2) +n <- 10000 +sum(1/(1:n)^2) +``` + +We can also obtain the same results using a function: + +```{r} +inverse_sum_of_squares <- function(n) { + sum(1/(1:n)^2) +} +inverse_sum_of_squares(100) +inverse_sum_of_squares(10000) +n <- 10000 +inverse_sum_of_squares(n) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Operations on vectors of unequal length + +Operations can also be performed on vectors of unequal length, through +a process known as _recycling_. This process automatically repeats the smaller vector +until it matches the length of the larger vector. R will provide a warning +if the larger vector is not a multiple of the smaller vector. + +```{r} +x <- c(1, 2, 3) +y <- c(1, 2, 3, 4, 5, 6, 7) +x + y +``` + +Vector `x` was recycled to match the length of vector `y` + +```{r, eval=FALSE} +x: 1 2 3 1 2 3 1 + + + + + + + + +y: 1 2 3 4 5 6 7 +----------------------- + 2 4 6 5 7 9 8 +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use vectorized operations instead of loops. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/10-functions.Rmd b/locale/uk/episodes/10-functions.Rmd new file mode 100644 index 000000000..ba405661f --- /dev/null +++ b/locale/uk/episodes/10-functions.Rmd @@ -0,0 +1,590 @@ +--- +title: Functions Explained +teaching: 45 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Define a function that takes arguments. +- Return a value from a function. +- Check argument conditions with `stopifnot()` in functions. +- Test a function. +- Set default values for function arguments. +- Explain why we should divide programs into small, single-purpose functions. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write a new function in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +If we only had one data set to analyze, it would probably be faster to load the +file into a spreadsheet and use that to plot simple statistics. However, the +gapminder data is updated periodically, and we may want to pull in that new +information later and re-run our analysis again. We may also obtain similar data +from a different source in the future. + +In this lesson, we'll learn how to write a function so that we can repeat +several operations with a single command. + +::::::::::::::::::::::::::::::::::::::::: callout + +## What is a function? + +Functions gather a sequence of operations into a whole, preserving it for +ongoing use. Functions provide: + +- a name we can remember and invoke it by +- relief from the need to remember the individual operations +- a defined set of inputs and expected outputs +- rich connections to the larger programming environment + +As the basic building block of most programming languages, user-defined +functions constitute "programming" as much as any single abstraction can. If +you have written a function, you are a computer programmer. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Defining a function + +Let's open a new R script file in the `functions/` directory and call it +functions-lesson.R. + +The general structure of a function is: + +```{r} +my_function <- function(parameters) { + # perform action + # return value +} +``` + +Let's define a function `fahr_to_kelvin()` that converts temperatures from +Fahrenheit to Kelvin: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +We define `fahr_to_kelvin()` by assigning it to the output of `function`. The +list of argument names are contained within parentheses. Next, the +[body](../learners/reference.md#body) of the function--the +statements that are executed when it runs--is contained within curly braces +(`{}`). The statements in the body are indented by two spaces. This makes the +code easier to read but does not affect how the code operates. + +It is useful to think of creating functions like writing a cookbook. First you define the "ingredients" that your function needs. In this case, we only need one ingredient to use our function: "temp". After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it. + +When we call the function, the values we pass to it as arguments are assigned to +those variables so that we can use them inside the function. Inside the +function, we use a return +statement to send a result back to +whoever asked for it. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +One feature unique to R is that the return statement is not required. +R automatically returns whichever variable is on the last line of the body +of the function. But for clarity, we will explicitly define the +return statement. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Let's try running our function. +Calling our own function is no different from calling any other function: + +```{r} +# freezing point of water +fahr_to_kelvin(32) +``` + +```{r} +# boiling point of water +fahr_to_kelvin(212) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a function called `kelvin_to_celsius()` that takes a temperature in +Kelvin and returns that temperature in Celsius. + +Hint: To convert from Kelvin to Celsius you subtract 273.15 + +::::::::::::::: solution + +## Solution to challenge 1 + +Write a function called `kelvin_to_celsius` that takes a temperature in Kelvin +and returns that temperature in Celsius + +```{r} +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Combining functions + +The real power of functions comes from mixing, matching and combining them +into ever-larger chunks to get the effect we want. + +Let's define two functions that will convert temperature from Fahrenheit to +Kelvin, and Kelvin to Celsius: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} + +kelvin_to_celsius <- function(temp) { + celsius <- temp - 273.15 + return(celsius) +} +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing the two functions above (or using your own functions if you +prefer). + +::::::::::::::: solution + +## Solution to challenge 2 + +Define the function to convert directly from Fahrenheit to Celsius, +by reusing these two functions above + +```{r} +fahr_to_celsius <- function(temp) { + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Interlude: Defensive Programming + +Now that we've begun to appreciate how writing functions provides an efficient +way to make R code re-usable and modular, we should note that it is important +to ensure that functions only work in their intended use-cases. Checking +function parameters is related to the concept of _defensive programming_. +Defensive programming encourages us to frequently check conditions and throw an +error if something is wrong. These checks are referred to as assertion +statements because we want to assert some condition is `TRUE` before proceeding. +They make it easier to debug because they give us a better idea of where the +errors originate. + +### Checking conditions with `stopifnot()` + +Let's start by re-examining `fahr_to_kelvin()`, our function for converting +temperatures from Fahrenheit to Kelvin. It was defined like so: + +```{r} +fahr_to_kelvin <- function(temp) { + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +For this function to work as intended, the argument `temp` must be a `numeric` +value; otherwise, the mathematical procedure for converting between the two +temperature scales will not work. To create an error, we can use the function +`stop()`. For example, since the argument `temp` must be a `numeric` vector, we +could check for this condition with an `if` statement and throw an error if the +condition was violated. We could augment our function above like so: + +```{r} +fahr_to_kelvin <- function(temp) { + if (!is.numeric(temp)) { + stop("temp must be a numeric vector.") + } + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +If we had multiple conditions or arguments to check, it would take many lines +of code to check all of them. Luckily R provides the convenience function +`stopifnot()`. We can list as many requirements that should evaluate to `TRUE`; +`stopifnot()` throws an error if it finds one that is `FALSE`. Listing these +conditions also serves a secondary purpose as extra documentation for the +function. + +Let's try out defensive programming with `stopifnot()` by adding assertions to +check the input to our function `fahr_to_kelvin()`. + +We want to assert the following: `temp` is a numeric vector. We may do that like +so: + +```{r} +fahr_to_kelvin <- function(temp) { + stopifnot(is.numeric(temp)) + kelvin <- ((temp - 32) * (5 / 9)) + 273.15 + return(kelvin) +} +``` + +It still works when given proper input. + +```{r} +# freezing point of water +fahr_to_kelvin(temp = 32) +``` + +But fails instantly if given improper input. + +```{r} +# Metric is a factor instead of numeric +fahr_to_kelvin(temp = as.factor(32)) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use defensive programming to ensure that our `fahr_to_celsius()` function +throws an error immediately if the argument `temp` is specified +inappropriately. + +::::::::::::::: solution + +## Solution to challenge 3 + +Extend our previous definition of the function by adding in an explicit call +to `stopifnot()`. Since `fahr_to_celsius()` is a composition of two other +functions, checking inside here makes adding checks to the two component +functions redundant. + +```{r} +fahr_to_celsius <- function(temp) { + stopifnot(is.numeric(temp)) + temp_k <- fahr_to_kelvin(temp) + result <- kelvin_to_celsius(temp_k) + return(result) +} +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## More on combining functions + +Now, we're going to define a function that calculates the Gross Domestic Product +of a nation from the data available in our dataset: + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat) { + gdp <- dat$pop * dat$gdpPercap + return(gdp) +} +``` + +We define `calcGDP()` by assigning it to the output of `function`. The list of +argument names are contained within parentheses. Next, the body of the function +\-- the statements executed when you call the function -- is contained within +curly braces (`{}`). + +We've indented the statements in the body by two spaces. This makes the code +easier to read but does not affect how it operates. + +When we call the function, the values we pass to it are assigned to the +arguments, which become variables inside the body of the function. + +Inside the function, we use the `return()` function to send back the result. +This `return()` function is optional: R will automatically return the results of +whatever command is executed on the last line of the function. + +```{r} +calcGDP(head(gapminder)) +``` + +That's not very informative. Let's add some more arguments so we can extract +that per year and country. + +```{r} +# Takes a dataset and multiplies the population column +# with the GDP per capita column. +calcGDP <- function(dat, year=NULL, country=NULL) { + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } + gdp <- dat$pop * dat$gdpPercap + + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +If you've been writing these functions down into a separate R script +(a good idea!), you can load in the functions into our R session by using the +`source()` function: + +```{r, eval=FALSE} +source("functions/functions-lesson.R") +``` + +Ok, so there's a lot going on in this function now. In plain English, the +function now subsets the provided data by year if the year argument isn't empty, +then subsets the result by country if the country argument isn't empty. Then it +calculates the GDP for whatever subset emerges from the previous two steps. The +function then adds the GDP as a new column to the subsetted data and returns +this as the final result. You can see that the output is much more informative +than a vector of numbers. + +Let's take a look at what happens when we specify the year: + +```{r} +head(calcGDP(gapminder, year=2007)) +``` + +Or for a specific country: + +```{r} +calcGDP(gapminder, country="Australia") +``` + +Or both: + +```{r} +calcGDP(gapminder, year=2007, country="Australia") +``` + +Let's walk through the body of the function: + +```{r, eval=FALSE} +calcGDP <- function(dat, year=NULL, country=NULL) { +``` + +Here we've added two arguments, `year`, and `country`. We've set +_default arguments_ for both as `NULL` using the `=` operator +in the function definition. This means that those arguments will +take on those values unless the user specifies otherwise. + +```{r, eval=FALSE} + if(!is.null(year)) { + dat <- dat[dat$year %in% year, ] + } + if (!is.null(country)) { + dat <- dat[dat$country %in% country,] + } +``` + +Here, we check whether each additional argument is set to `null`, and whenever +they're not `null` overwrite the dataset stored in `dat` with a subset given by +the non-`null` argument. + +Building these conditionals into the function makes it more flexible for later. +Now, we can use it to calculate the GDP for: + +- The whole dataset; +- A single year; +- A single country; +- A single combination of year and country. + +By using `%in%` instead, we can also give multiple years or countries to those +arguments. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Pass by value + +Functions in R almost always make copies of the data to operate on +inside of a function body. When we modify `dat` inside the function +we are modifying the copy of the gapminder dataset stored in `dat`, +not the original variable we gave as the first argument. + +This is called "pass-by-value" and it makes writing code much safer: +you can always be sure that whatever changes you make within the +body of the function, stay inside the body of the function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Function scope + +Another important concept is scoping: any variables (or functions!) you +create or modify inside the body of a function only exist for the lifetime +of the function's execution. When we call `calcGDP()`, the variables `dat`, +`gdp` and `new` only exist inside the body of the function. Even if we +have variables of the same name in our interactive R session, they are +not modified in any way when executing a function. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, eval=FALSE} + gdp <- dat$pop * dat$gdpPercap + new <- cbind(dat, gdp=gdp) + return(new) +} +``` + +Finally, we calculated the GDP on our new subset, and created a new data frame +with that column added. This means when we call the function later we can see +the context for the returned GDP values, which is much better than in our first +attempt where we got a vector of numbers. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Test out your GDP function by calculating the GDP for New Zealand in 1987. How +does this differ from New Zealand's GDP in 1952? + +::::::::::::::: solution + +## Solution to challenge 4 + +```{r, eval=FALSE} + calcGDP(gapminder, year = c(1952, 1987), country = "New Zealand") +``` + +GDP for New Zealand in 1987: 65050008703 + +GDP for New Zealand in 1952: 21058193787 + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 5 + +The `paste()` function can be used to combine text together, e.g: + +```{r} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +paste(best_practice, collapse=" ") +``` + +Write a function called `fence()` that takes two vectors as arguments, called +`text` and `wrapper`, and prints out the text wrapped with the `wrapper`: + +```{r, eval=FALSE} +fence(text=best_practice, wrapper="***") +``` + +_Note:_ the `paste()` function has an argument called `sep`, which specifies +the separator between text. The default is a space: " ". The default for +`paste0()` is no space "". + +::::::::::::::: solution + +## Solution to challenge 5 + +Write a function called `fence()` that takes two vectors as arguments, +called `text` and `wrapper`, and prints out the text wrapped with the +`wrapper`: + +```{r} +fence <- function(text, wrapper){ + text <- c(wrapper, text, wrapper) + result <- paste(text, collapse = " ") + return(result) +} +best_practice <- c("Write", "programs", "for", "people", "not", "computers") +fence(text=best_practice, wrapper="***") +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip + +R has some unique aspects that can be exploited when performing more +complicated operations. We will not be writing anything that requires +knowledge of these more advanced concepts. In the future when you are +comfortable writing functions in R, you can learn more by reading the +[R Language Manual][man] or this [chapter] from +[Advanced R Programming][adv-r] by Hadley Wickham. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Testing and documenting + +It's important to both test functions and document them: +Documentation helps you, and others, understand what the +purpose of your function is, and how to use it, and its +important to make sure that your function actually does +what you think. + +When you first start out, your workflow will probably look a lot +like this: + +1. Write a function +2. Comment parts of the function to document its behaviour +3. Load in the source file +4. Experiment with it in the console to make sure it behaves + as you expect +5. Make any necessary bug fixes +6. Rinse and repeat. + +Formal documentation for functions, written in separate `.Rd` +files, gets turned into the documentation you see in help +files. The [roxygen2] package allows R coders to write documentation +alongside the function code and then process it into the appropriate `.Rd` +files. You will want to switch to this more formal method of writing +documentation when you start writing more complicated R projects. In fact, +packages are, in essence, bundles of functions with this formal documentation. +Loading your own functions through `source("functions.R")` is equivalent to +loading someone else's functions (or your own one day!) through +`library("package")`. + +Formal automated tests can be written using the [testthat] package. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +[man]: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Environment-objects +[chapter]: https://adv-r.had.co.nz/Environments.html +[adv-r]: https://adv-r.had.co.nz/ +[roxygen2]: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html +[testthat]: https://r-pkgs.had.co.nz/tests.html + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use `function` to define a new function in R. +- Use parameters to pass values into functions. +- Use `stopifnot()` to flexibly check function arguments in R. +- Load functions into programs using `source()`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/11-writing-data.Rmd b/locale/uk/episodes/11-writing-data.Rmd new file mode 100644 index 000000000..646e11b7e --- /dev/null +++ b/locale/uk/episodes/11-writing-data.Rmd @@ -0,0 +1,188 @@ +--- +title: Writing Data +teaching: 10 +exercises: 10 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to write out plots and data from R. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I save plots and data created in R? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +library("ggplot2") +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +dir.create("cleaned-data") +``` + +## Saving plots + +You have already seen how to save the most recent plot you create in `ggplot2`, +using the command `ggsave`. As a refresher: + +```{r, eval=FALSE} +ggsave("My_most_recent_plot.pdf") +``` + +You can save a plot from within RStudio using the 'Export' button +in the 'Plot' window. This will give you the option of saving as a +.pdf or as .png, .jpg or other image formats. + +Sometimes you will want to save plots without creating them in the +'Plot' window first. Perhaps you want to make a pdf document with +multiple pages: each one a different plot, for example. Or perhaps +you're looping through multiple subsets of a file, plotting data from +each subset, and you want to save each plot, but obviously can't stop +the loop to click 'Export' for each one. + +In this case you can use a more flexible approach. The function +`pdf` creates a new pdf device. You can control the size and resolution +using the arguments to this function. + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width=12, height=4) +ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) + + geom_line() + + theme(legend.position = "none") + +# You then have to make sure to turn off the pdf device! + +dev.off() +``` + +Open up this document and have a look. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Rewrite your 'pdf' command to print a second +page in the pdf, showing a facet plot (hint: use `facet_grid`) +of the same data with one panel per continent. + +::::::::::::::: solution + +## Solution to challenge 1 + +```{r, eval=FALSE} +pdf("Life_Exp_vs_time.pdf", width = 12, height = 4) +p <- ggplot(data = gapminder, aes(x = year, y = lifeExp, colour = country)) + + geom_line() + + theme(legend.position = "none") +p +p + facet_grid(~continent) +dev.off() +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +The commands `jpeg`, `png` etc. are used similarly to produce +documents in different formats. + +## Writing data + +At some point, you'll also want to write out data from R. + +We can use the `write.table` function for this, which is +very similar to `read.table` from before. + +Let's create a data-cleaning script, for this analysis, we +only want to focus on the gapminder data for Australia: + +```{r} +aust_subset <- gapminder[gapminder$country == "Australia",] + +write.table(aust_subset, + file="cleaned-data/gapminder-aus.csv", + sep="," +) +``` + +Let's switch back to the shell to take a look at the data to make sure it looks +OK: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +Hmm, that's not quite what we wanted. Where did all these +quotation marks come from? Also the row numbers are +meaningless. + +Let's look at the help file to work out how to change this +behaviour. + +```{r, eval=FALSE} +?write.table +``` + +By default R will wrap character vectors with quotation marks +when writing out to file. It will also write out the row and +column names. + +Let's fix this: + +```{r} +write.table( + gapminder[gapminder$country == "Australia",], + file="cleaned-data/gapminder-aus.csv", + sep=",", quote=FALSE, row.names=FALSE +) +``` + +Now lets look at the data again using our shell skills: + +```{r, engine="bash"} +head cleaned-data/gapminder-aus.csv +``` + +That looks better! + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Write a data-cleaning script file that subsets the gapminder +data to include only data points collected since 1990. + +Use this script to write out the new subset to a file +in the `cleaned-data/` directory. + +::::::::::::::: solution + +## Solution to challenge 2 + +```{r, eval=FALSE} +write.table( + gapminder[gapminder$year > 1990, ], + file = "cleaned-data/gapminder-after1990.csv", + sep = ",", quote = FALSE, row.names = FALSE +) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, echo=FALSE} +# We remove after rendering the lesson, because we don't want this in the lesson +# repository +unlink("cleaned-data", recursive=TRUE) +``` + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Save plots from RStudio using the 'Export' button. +- Use `write.table` to save tabular data. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/12-dplyr.Rmd b/locale/uk/episodes/12-dplyr.Rmd new file mode 100644 index 000000000..0f5540883 --- /dev/null +++ b/locale/uk/episodes/12-dplyr.Rmd @@ -0,0 +1,487 @@ +--- +title: Data Frame Manipulation with dplyr +teaching: 40 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To be able to use the six main data frame manipulation 'verbs' with pipes in `dplyr`. +- To understand how `group_by()` and `summarize()` can be combined to summarize datasets. +- Be able to analyze a subset of data using logical filtering. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I manipulate data frames without repeating myself? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE) +``` + +Manipulation of data frames means many things to many researchers: we often +select certain observations (rows) or variables (columns), we often group the +data by a certain variable(s), or we even calculate summary statistics. We can +do these operations using the normal base R operations: + +```{r} +mean(gapminder$gdpPercap[gapminder$continent == "Africa"]) +mean(gapminder$gdpPercap[gapminder$continent == "Americas"]) +mean(gapminder$gdpPercap[gapminder$continent == "Asia"]) +``` + +But this isn't very _nice_ because there is a fair bit of repetition. Repeating +yourself will cost you time, both now and later, and potentially introduce some +nasty bugs. + +## The `dplyr` package + +Luckily, the [`dplyr`](https://cran.r-project.org/package=dplyr) +package provides a number of very useful functions for manipulating data frames +in a way that will reduce the above repetition, reduce the probability of making +errors, and probably even save you some typing. As an added bonus, you might +even find the `dplyr` grammar easier to read. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Tidyverse + +`dplyr` package belongs to a broader family of opinionated R packages +designed for data science called the "Tidyverse". These +packages are specifically designed to work harmoniously together. +Some of these packages will be covered along this course, but you can find more +complete information here: [https://www.tidyverse.org/](https://www.tidyverse.org/). + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Here we're going to cover 5 of the most commonly used functions as well as using +pipes (`%>%`) to combine them. + +1. `select()` +2. `filter()` +3. `group_by()` +4. `summarize()` +5. `mutate()` + +If you have have not installed this package earlier, please do so: + +```{r, eval=FALSE} +install.packages('dplyr') +``` + +Now let's load the package: + +```{r, message=FALSE} +library("dplyr") +``` + +## Using select() + +If, for example, we wanted to move forward with only a few of the variables in +our data frame we could use the `select()` function. This will keep only the +variables you select. + +```{r} +year_country_gdp <- select(gapminder, year, country, gdpPercap) +``` + +![](fig/13-dplyr-fig1.png){alt='Diagram illustrating use of select function to select two columns of a data frame'} +If we want to remove one column only from the `gapminder` data, for example, +removing the `continent` column. + +```{r} +smaller_gapminder_data <- select(gapminder, -continent) +``` + +If we open up `year_country_gdp` we'll see that it only contains the year, +country and gdpPercap. Above we used 'normal' grammar, but the strengths of +`dplyr` lie in combining several functions using pipes. Since the pipes grammar +is unlike anything we've seen in R before, let's repeat what we've done above +using pipes. + +```{r} +year_country_gdp <- gapminder %>% select(year, country, gdpPercap) +``` + +To help you understand why we wrote that in that way, let's walk through it step +by step. First we summon the gapminder data frame and pass it on, using the pipe +symbol `%>%`, to the next step, which is the `select()` function. In this case +we don't specify which data object we use in the `select()` function since in +gets that from the previous pipe. **Fun Fact**: There is a good chance you have +encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the +shell it is `|` but the concept is the same! + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Renaming data frame columns in dplyr + +In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the `names()` function. +Just like select, this is a bit cumbersome, but thankfully dplyr has a `rename()` function. + +Within a pipeline, the syntax is `rename(new_name = old_name)`. +For example, we may want to rename the gdpPercap column name from our `select()` statement above. + +```{r} +tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap) + +head(tidy_gdp) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Using filter() + +If we now want to move forward with the above, but only with European +countries, we can combine `select` and `filter` + +```{r} +year_country_gdp_euro <- gapminder %>% + filter(continent == "Europe") %>% + select(year, country, gdpPercap) +``` + +If we now want to show life expectancy of European countries but only +for a specific year (e.g., 2007), we can do as below. + +```{r} +europe_lifeExp_2007 <- gapminder %>% + filter(continent == "Europe", year == 2007) %>% + select(country, lifeExp) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Write a single command (which can span multiple lines and includes pipes) that +will produce a data frame that has the African values for `lifeExp`, `country` +and `year`, but not for other Continents. How many rows does your data frame +have and why? + +::::::::::::::: solution + +## Solution to Challenge 1 + +```{r} +year_country_lifeExp_Africa <- gapminder %>% + filter(continent == "Africa") %>% + select(year, country, lifeExp) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +As with last time, first we pass the gapminder data frame to the `filter()` +function, then we pass the filtered version of the gapminder data frame to the +`select()` function. **Note:** The order of operations is very important in this +case. If we used 'select' first, filter would not be able to find the variable +continent since we would have removed it in the previous step. + +## Using group\_by() + +Now, we were supposed to be reducing the error prone repetitiveness of what can +be done with base R, but up to now we haven't done that since we would have to +repeat the above for each continent. Instead of `filter()`, which will only pass +observations that meet your criteria (in the above: `continent=="Europe"`), we +can use `group_by()`, which will essentially use every unique criteria that you +could have used in filter. + +```{r} +str(gapminder) + +str(gapminder %>% group_by(continent)) +``` + +You will notice that the structure of the data frame where we used `group_by()` +(`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A +`grouped_df` can be thought of as a `list` where each item in the `list`is a +`data.frame` which contains only the rows that correspond to the a particular +value `continent` (at least in the example above). + +![](fig/13-dplyr-fig2.png){alt='Diagram illustrating how the group by function oraganizes a data frame into groups'} + +## Using summarize() + +The above was a bit on the uneventful side but `group_by()` is much more +exciting in conjunction with `summarize()`. This will allow us to create new +variable(s) by using functions that repeat for each of the continent-specific +data frames. That is to say, using the `group_by()` function, we split our +original data frame into multiple pieces, then we can run functions +(e.g. `mean()` or `sd()`) within `summarize()`. + +```{r} +gdp_bycontinents <- gapminder %>% + group_by(continent) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +![](fig/13-dplyr-fig3.png){alt='Diagram illustrating the use of group by and summarize together to create a new variable'} + +```{r, eval=FALSE} +continent mean_gdpPercap + +1 Africa 2193.755 +2 Americas 7136.110 +3 Asia 7902.150 +4 Europe 14469.476 +5 Oceania 18621.609 +``` + +That allowed us to calculate the mean gdpPercap for each continent, but it gets +even better. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Calculate the average life expectancy per country. Which has the longest average life +expectancy and which has the shortest average life expectancy? + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +lifeExp_bycountry <- gapminder %>% + group_by(country) %>% + summarize(mean_lifeExp = mean(lifeExp)) +lifeExp_bycountry %>% + filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp)) +``` + +Another way to do this is to use the `dplyr` function `arrange()`, which +arranges the rows in a data frame according to the order of one or more +variables from the data frame. It has similar syntax to other functions from +the `dplyr` package. You can use `desc()` inside `arrange()` to sort in +descending order. + +```{r} +lifeExp_bycountry %>% + arrange(mean_lifeExp) %>% + head(1) +lifeExp_bycountry %>% + arrange(desc(mean_lifeExp)) %>% + head(1) +``` + +Alphabetical order works too + +```{r} +lifeExp_bycountry %>% + arrange(desc(country)) %>% + head(1) +``` + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::: + +The function `group_by()` allows us to group by multiple variables. Let's group by `year` and `continent`. + +```{r} +gdp_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap)) +``` + +That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in `summarize()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop)) +``` + +## count() and n() + +A very common operation is to count the number of observations for each +group. The `dplyr` package comes with two related functions that help with this. + +For instance, if we wanted to check the number of countries included in the +dataset for the year 2002, we can use the `count()` function. It takes the name +of one or more columns that contain the groups we are interested in, and we can +optionally sort the results in descending order by adding `sort=TRUE`: + +```{r} +gapminder %>% + filter(year == 2002) %>% + count(continent, sort = TRUE) +``` + +If we need to use the number of observations in calculations, the `n()` function +is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectency per continent: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize(se_le = sd(lifeExp)/sqrt(n())) +``` + +You can also chain together several summary operations; in this case calculating the `minimum`, `maximum`, `mean` and `se` of each continent's per-country life-expectancy: + +```{r} +gapminder %>% + group_by(continent) %>% + summarize( + mean_le = mean(lifeExp), + min_le = min(lifeExp), + max_le = max(lifeExp), + se_le = sd(lifeExp)/sqrt(n())) +``` + +## Using mutate() + +We can also create new variables prior to (or even after) summarizing information using `mutate()`. + +```{r} +gdp_pop_bycontinents_byyear <- gapminder %>% + mutate(gdp_billion = gdpPercap*pop/10^9) %>% + group_by(continent,year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) +``` + +## Connect mutate with logical filtering: ifelse + +When creating new variables, we can hook this with a logical condition. A simple combination of +`mutate()` and `ifelse()` facilitates filtering right where it is needed: in the moment of creating something new. +This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension +of the data frame will not change) or for updating values depending on this given condition. + +```{r} +## keeping all data but "filtering" after a certain condition +# calculate GDP only for people with a life expectation above 25 +gdp_pop_bycontinents_byyear_above25 <- gapminder %>% + mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + sd_gdpPercap = sd(gdpPercap), + mean_pop = mean(pop), + sd_pop = sd(pop), + mean_gdp_billion = mean(gdp_billion), + sd_gdp_billion = sd(gdp_billion)) + +## updating only if certain condition is fullfilled +# for life expectations above 40 years, the gpd to be expected in the future is scaled +gdp_future_bycontinents_byyear_high_lifeExp <- gapminder %>% + mutate(gdp_futureExpectation = ifelse(lifeExp > 40, gdpPercap * 1.5, gdpPercap)) %>% + group_by(continent, year) %>% + summarize(mean_gdpPercap = mean(gdpPercap), + mean_gdpPercap_expected = mean(gdp_futureExpectation)) +``` + +## Combining `dplyr` and `ggplot2` + +First install and load ggplot2: + +```{r, eval=FALSE} +install.packages('ggplot2') +``` + +```{r, message=FALSE} +library("ggplot2") +``` + +In the plotting lesson we looked at how to make a multi-panel figure by adding +a layer of facet panels using `ggplot2`. Here is the code we used (with some +extra comments): + +```{r} +# Filter countries located in the Americas +americas <- gapminder[gapminder$continent == "Americas", ] +# Make the plot +ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +This code makes the right plot but it also creates an intermediate variable +(`americas`) that we might not have any other uses for. Just as we used +`%>%` to pipe data along a chain of `dplyr` functions we can use it to pass data +to `ggplot()`. Because `%>%` replaces the first argument in a function we don't +need to specify the `data =` argument in the `ggplot()` function. By combining +`dplyr` and `ggplot2` functions we can make the same figure without creating any +new variables or modifying the data. + +```{r} +gapminder %>% + # Filter countries located in the Americas + filter(continent == "Americas") %>% + # Make the plot + ggplot(mapping = aes(x = year, y = lifeExp)) + + geom_line() + + facet_wrap( ~ country) + + theme(axis.text.x = element_text(angle = 45)) +``` + +More examples of using the function `mutate()` and the `ggplot2` package. + +```{r} +gapminder %>% + # extract first letter of country name into new column + mutate(startsWith = substr(country, 1, 1)) %>% + # only keep countries starting with A or Z + filter(startsWith %in% c("A", "Z")) %>% + # plot lifeExp into facets + ggplot(aes(x = year, y = lifeExp, colour = continent)) + + geom_line() + + facet_wrap(vars(country)) + + theme_minimal() +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Advanced Challenge + +Calculate the average life expectancy in 2002 of 2 randomly selected countries +for each continent. Then arrange the continent names in reverse order. +**Hint:** Use the `dplyr` functions `arrange()` and `sample_n()`, they have +similar syntax to other dplyr functions. + +::::::::::::::: solution + +## Solution to Advanced Challenge + +```{r} +lifeExp_2countries_bycontinents <- gapminder %>% + filter(year==2002) %>% + group_by(continent) %>% + sample_n(2) %>% + summarize(mean_lifeExp=mean(lifeExp)) %>% + arrange(desc(mean_lifeExp)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to dplyr](https://dplyr.tidyverse.org/) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) +- [Tidyverse Skills for Data Science](https://jhudatascience.org/tidyversecourse/) (online book) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `dplyr` package to manipulate data frames. +- Use `select()` to choose variables from a data frame. +- Use `filter()` to choose data based on values. +- Use `group_by()` and `summarize()` to work with subsets of data. +- Use `mutate()` to create new variables. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/13-tidyr.Rmd b/locale/uk/episodes/13-tidyr.Rmd new file mode 100644 index 000000000..96e59d18d --- /dev/null +++ b/locale/uk/episodes/13-tidyr.Rmd @@ -0,0 +1,321 @@ +--- +title: Data Frame Manipulation with tidyr +teaching: 30 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- To understand the concepts of 'longer' and 'wider' data frame formats and be able to convert between them with `tidyr`. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I change the layout of a data frame? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r, include=FALSE} +gapminder <- read.csv("data/gapminder_data.csv", header = TRUE, stringsAsFactors = FALSE) +gap_wide <- read.csv("data/gapminder_wide.csv", header = TRUE, stringsAsFactors = FALSE) +``` + +Researchers often want to reshape their data frames from 'wide' to 'longer' +layouts, or vice-versa. The 'long' layout or format is where: + +- each column is a variable +- each row is an observation + +In the purely 'long' (or 'longest') format, you usually have 1 column for the observed variable and the other columns are ID variables. + +For the 'wide' format each row is often a site/subject/patient and you have +multiple observation variables containing the same type of data. These can be +either repeated observations over time, or observation of multiple variables (or +a mix of both). You may find data input may be simpler or some other +applications may prefer the 'wide' format. However, many of `R`'s functions have +been designed assuming you have 'longer' formatted data. This tutorial will help you +efficiently transform your data shape regardless of original format. + +![](fig/14-tidyr-fig1.png){alt='Diagram illustrating the difference between a wide versus long layout of a data frame'} + +Long and wide data frame layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due +to its shape. However, the long format is more machine readable and is closer +to the formatting of databases. The ID variables in our data frames are similar to +the fields in a database and observed variables are like the database values. + +## Getting started + +First install the packages if you haven't already done so (you probably +installed dplyr in the previous lesson): + +```{r, eval=FALSE} +#install.packages("tidyr") +#install.packages("dplyr") +``` + +Load the packages + +```{r, message=FALSE} +library("tidyr") +library("dplyr") +``` + +First, lets look at the structure of our original gapminder data frame: + +```{r} +str(gapminder) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Is gapminder a purely long, purely wide, or some intermediate format? + +::::::::::::::: solution + +## Solution to Challenge 1 + +The original gapminder data.frame is in an intermediate format. It is not +purely long since it had multiple observation variables +(`pop`,`lifeExp`,`gdpPercap`). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Sometimes, as with the gapminder dataset, we have multiple types of observed +data. It is somewhere in between the purely 'long' and 'wide' data formats. We +have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation +variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be +preferred despite not having ALL observations in 1 column given that all 3 +observation variables have different units. There are few operations that would +need us to make this data frame any longer (i.e. 4 ID variables and 1 +Observation variable). + +While using many of the functions in R, which are often vector based, you +usually do not want to do mathematical operations on values with different +units. For example, using the purely long format, a single mean for all of the +values of population, life expectancy, and GDP would not be meaningful since it +would return the mean of values with 3 incompatible units. The solution is that +we first manipulate the data either by grouping (see the lesson on `dplyr`), or +we change the structure of the data frame. **Note:** Some plotting functions in +R actually work better in the wide format data. + +## From wide to long format with pivot\_longer() + +Until now, we've been using the nicely formatted original gapminder dataset, but +'real' data (i.e. our own research data) will never be so well organized. Here +let's start with the wide formatted version of the gapminder dataset. + +> Download the wide version of the gapminder data from [this link to a csv file](data/gapminder_wide.csv) +> and save it in your data folder. + +We'll load the data file and look at it. Note: we don't want our continent and +country columns to be factors, so we use the stringsAsFactors argument for +`read.csv()` to disable that. + +```{r} +gap_wide <- read.csv("data/gapminder_wide.csv", stringsAsFactors = FALSE) +str(gap_wide) +``` + +![](fig/14-tidyr-fig2.png){alt='Diagram illustrating the wide format of the gapminder data frame'} + +To change this very wide data frame layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable. + +![](fig/14-tidyr-fig3.png){alt='Diagram illustrating how pivot longer reorganizes a data frame from a wide to long format'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +Here we have used piping syntax which is similar to what we were doing in the +previous lesson with dplyr. In fact, these are compatible and you can use a mix +of tidyr and dplyr functions by piping them together. + +We first provide to `pivot_longer()` a vector of column names that will be +pivoted into longer format. We could type out all the observation variables, but +as in the `select()` function (see `dplyr` lesson), we can use the `starts_with()` +argument to select all variables that start with the desired character string. +`pivot_longer()` also allows the alternative syntax of using the `-` symbol to +identify which variables are not to be pivoted (i.e. ID variables). + +The next arguments to `pivot_longer()` are `names_to` for naming the column that +will contain the new ID variable (`obstype_year`) and `values_to` for naming the +new amalgamated observation variable (`obs_value`). We supply these new column +names as strings. + +![](fig/14-tidyr-fig4.png){alt='Diagram illustrating the long format of the gapminder data'} + +```{r} +gap_long <- gap_wide %>% + pivot_longer( + cols = c(-continent, -country), + names_to = "obstype_year", values_to = "obs_values" + ) +str(gap_long) +``` + +That may seem trivial with this particular data frame, but sometimes you have 1 +ID variable and 40 observation variables with irregular variable names. The +flexibility is a huge time saver! + +Now `obstype_year` actually contains 2 pieces of information, the observation +type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the +`separate()` function to split the character strings into multiple variables + +```{r} +gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_") +gap_long$year <- as.integer(gap_long$year) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Using `gap_long`, calculate the mean life expectancy, population, and gdpPercap for each continent. +**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson + +::::::::::::::: solution + +## Solution to Challenge 2 + +```{r} +gap_long %>% group_by(continent, obs_type) %>% + summarize(means=mean(obs_values)) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## From long to intermediate format with pivot\_wider() + +It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let's start with the intermediate format. + +The `pivot_wider()` function takes `names_from` and `values_from` arguments. + +To `names_from` we supply the column name whose contents will be pivoted into new +output columns in the widened data frame. The corresponding values will be added +from the column named in the `values_from` argument. + +```{r} +gap_normal <- gap_long %>% + pivot_wider(names_from = obs_type, values_from = obs_values) +dim(gap_normal) +dim(gapminder) +names(gap_normal) +names(gapminder) +``` + +Now we've got an intermediate data frame `gap_normal` with the same dimensions as +the original `gapminder`, but the order of the variables is different. Let's fix +that before checking if they are `all.equal()`. + +```{r} +gap_normal <- gap_normal[, names(gapminder)] +all.equal(gap_normal, gapminder) +head(gap_normal) +head(gapminder) +``` + +We're almost there, the original was sorted by `country`, then +`year`. + +```{r} +gap_normal <- gap_normal %>% arrange(country, year) +all.equal(gap_normal, gapminder) +``` + +That's great! We've gone from the longest format back to the intermediate and we +didn't introduce any errors in our code. + +Now let's convert the long all the way back to the wide. In the wide format, we +will keep country and continent as ID variables and pivot the observations +across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we +need to create appropriate labels for all our new variables (time\*metric +combinations) and we also need to unify our ID variables to simplify the process +of defining `gap_wide`. + +```{r} +gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_") +str(gap_temp) + +gap_temp <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") +str(gap_temp) +``` + +Using `unite()` we now have a single ID variable which is a combination of +`continent`,`country`,and we have defined variable names. We're now ready to +pipe in `pivot_wider()` + +```{r} +gap_wide_new <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +str(gap_wide_new) +``` + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Take this 1 step further and create a `gap_ludicrously_wide` format data by pivoting over countries, year and the 3 metrics? +**Hint** this new data frame should only have 5 rows. + +::::::::::::::: solution + +## Solution to Challenge 3 + +```{r} +gap_ludicrously_wide <- gap_long %>% + unite(var_names, obs_type, year, country, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) +``` + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +Now we have a great 'wide' format data frame, but the `ID_var` could be more +usable, let's separate it into 2 variables with `separate()` + +```{r} +gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_") +gap_wide_betterID <- gap_long %>% + unite(ID_var, continent, country, sep = "_") %>% + unite(var_names, obs_type, year, sep = "_") %>% + pivot_wider(names_from = var_names, values_from = obs_values) %>% + separate(ID_var, c("continent","country"), sep = "_") +str(gap_wide_betterID) + +all.equal(gap_wide, gap_wide_betterID) +``` + +There and back again! + +## Other great resources + +- [R for Data Science](https://r4ds.hadley.nz/) (online book) +- [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) (pdf file) +- [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) (online documentation) +- [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/) (online video) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Use the `tidyr` package to change the layout of data frames. +- Use `pivot_longer()` to go from wide to longer layout. +- Use `pivot_wider()` to go from long to wider layout. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/14-knitr-markdown.Rmd b/locale/uk/episodes/14-knitr-markdown.Rmd new file mode 100644 index 000000000..5829180aa --- /dev/null +++ b/locale/uk/episodes/14-knitr-markdown.Rmd @@ -0,0 +1,493 @@ +--- +title: Producing Reports With knitr +teaching: 60 +exercises: 15 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Understand the value of writing reproducible reports +- Learn how to recognise and compile the basic components of an R Markdown file +- Become familiar with R code chunks, and understand their purpose, structure and options +- Demonstrate the use of inline chunks for weaving R outputs into text blocks, for example when discussing the results of some calculations +- Be aware of alternative output formats to which an R Markdown file can be exported + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I integrate software and reports? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +```{r chunk_options, include=FALSE} +``` + +## Data analysis reports + +Data analysts tend to write a lot of reports, describing their +analyses and results, for their collaborators or to document their +work for future reference. + +Many new users begin by first writing a single R script containing all of their +work, and then share the analysis by emailing the script and various graphs +as attachments. But this can be cumbersome, requiring a lengthy discussion to +explain which attachment was which result. + +Writing formal reports with Word or [LaTeX](https://www.latex-project.org/) +can simplify this process by incorporating both the analysis report and output graphs +into a single document. But tweaking formatting to make figures look correct +and fixing obnoxious page breaks can be tedious and lead to a lengthy "whack-a-mole" +game of fixing new mistakes resulting from a single formatting change. + +Creating a report as a web page (which is an html file) using R Markdown makes things easier. +The report can be one long stream, so tall figures that wouldn't ordinarily fit on +one page can be kept at full size and easier to read, since the reader can simply +keep scrolling. Additionally, the formatting of and R Markdown document is simple and easy to modify, allowing you to spend +more time on your analyses instead of writing reports. + +## Literate programming + +Ideally, such analysis reports are _reproducible_ documents: If an +error is discovered, or if some additional subjects are added to the +data, you can just re-compile the report and get the new or corrected +results rather than having to reconstruct figures, paste them into +a Word document, and hand-edit various detailed results. + +The key R package here is [`knitr`](https://yihui.name/knitr/). It allows you +to create a document that is a mixture of text and chunks of +code. When the document is processed by `knitr`, chunks of code will +be executed, and graphs or other results will be inserted into the final document. + +This sort of idea has been called "literate programming". + +`knitr` allows you to mix basically any type of text with code from different programming languages, but we recommend that you use `R Markdown`, which mixes Markdown +with R. [Markdown](https://www.markdownguide.org/) is a light-weight mark-up language for creating web +pages. + +## Creating an R Markdown file + +Within RStudio, click File → New File → R Markdown and +you'll get a dialog box like this: + +![](fig/New_R_Markdown.png){alt='Screenshot of the New R Markdown file dialogue box in RStudio'} + +You can stick with the default (HTML output), but give it a title. + +## Basic components of R Markdown + +The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it what type of output you want +to produce. In this case, we're creating an html document. + +``` +--- +title: "Initial R Markdown document" +author: "Karl Broman" +date: "April 23, 2015" +output: html_document +--- +``` + +You can delete any of those fields if you don't want them +included. The double-quotes aren't strictly _necessary_ in this case. +They're mostly needed if you want to include a colon in the title. + +RStudio creates the document with some example text to get you +started. Note below that there are chunks like + +
``{r}
+summary(cars)
+```
+
+ +These are chunks of R code that will be executed by `knitr` and replaced +by their results. More on this later. + +## Markdown + +Markdown is a system for writing web pages by marking up the text much +as you would in an email rather than writing html code. The marked-up +text gets _converted_ to html, replacing the marks with the proper +html code. + +For now, let's delete all of the stuff that's there and write a bit of +markdown. + +You make things **bold** using two asterisks, like this: `**bold**`, +and you make things _italics_ by using underscores, like this: +`_italics_`. + +You can make a bulleted list by writing a list with hyphens or +asterisks with a space between the list and other text, like this: + +``` +A list: + +* bold with double-asterisks +* italics with underscores +* code-type font with backticks +``` + +or like this: + +``` +A second list: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks +``` + +Each will appear as: + +- bold with double-asterisks +- italics with underscores +- code-type font with backticks + +You can use whatever method you prefer, but _be consistent_. This maintains the +readability of your code. + +You can make a numbered list by just using numbers. You can even use the +same number over and over if you want: + +``` +1. bold with double-asterisks +1. italics with underscores +1. code-type font with backticks +``` + +This will appear as: + +1. bold with double-asterisks +2. italics with underscores +3. code-type font with backticks + +You can make section headers of different sizes by initiating a line +with some number of `#` symbols: + +``` +# Title +## Main section +### Sub-section +#### Sub-sub section +``` + +You _compile_ the R Markdown document to an html webpage by clicking +the "Knit" button in the upper-left. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1 + +Create a new R Markdown document. Delete all of the R code chunks +and write a bit of Markdown (some sections, some italicized +text, and an itemized list). + +Convert the document to a webpage. + +::::::::::::::: solution + +## Solution to Challenge 1 + +In RStudio, select File > New file > R Markdown... + +Delete the placeholder text and add the following: + +``` +# Introduction + +## Background on Data + +This report uses the *gapminder* dataset, which has columns that include: + +* country +* continent +* year +* lifeExp +* pop +* gdpPercap + +## Background on Methods + +``` + +Then click the 'Knit' button on the toolbar to generate an html document (webpage). + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## A bit more Markdown + +You can make a hyperlink like this: +`[Carpentries Home Page](https://carpentries.org/)`. + +You can include an image file like this: `![The Carpentries Logo](https://carpentries.org/assets/img/TheCarpentries.svg)` + +You can do subscripts (e.g., F~2~) with `F~2~` and superscripts (e.g., +F^2^) with `F^2^`. + +If you know how to write equations in +[LaTeX](https://www.latex-project.org/), you can use `$ $` and `$$ $$` to insert math equations, like +`$E = mc^2$` and + +``` +$$y = \mu + \sum_{i=1}^p \beta_i x_i + \epsilon$$ +``` + +You can review Markdown syntax by navigating to the +"Markdown Quick Reference" under the "Help" field in the +toolbar at the top of RStudio. + +## R code chunks + +The real power of Markdown comes from +mixing markdown with chunks of code. This is R Markdown. When +processed, the R code will be executed; if they produce figures, the +figures will be inserted in the final document. + +The main code chunks look like this: + +
``{r load_data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +That is, you place a chunk of R code between \`\`\`{r chunk\_name} +and \`\`\`. You should give each chunk +a unique name, as they will help you to fix errors and, if any graphs are +produced, the file names are based on the name of the code chunk that +produced them. You can create code chunks quickly in RStudio using the shortcuts Ctrl\+Alt\+I on Windows and Linux, or Cmd\+Option\+I on Mac. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2 + +Add code chunks to: + +- Load the ggplot2 package +- Read the gapminder data +- Create a plot + +::::::::::::::: solution + +## Solution to Challenge 2 + +
``{r load-ggplot2}
+library("ggplot2")
+```
+
+ +
``{r read-gapminder-data}
+gapminder <- read.csv("gapminder.csv")
+```
+
+ +
``{r make-plot}
+plot(lifeExp ~ year, data = gapminder)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## How things get compiled + +When you press the "Knit" button, the R Markdown document is +processed by [`knitr`](https://yihui.name/knitr) and a plain Markdown +document is produced (as well as, potentially, a set of figure files): the R code is executed +and replaced by both the input and the output; if figures are +produced, links to those figures are included. + +The Markdown and figure documents are then processed by the tool +[`pandoc`](https://pandoc.org/), which converts the Markdown file into an +html file, with the figures embedded. + +```{r rmd_to_html_fig, fig.width=8, fig.height=3, fig.align="left", echo=FALSE} +par(mar=rep(0, 4), bty="n", cex=1.5) +plot(0, 0, type="n", xlab="", ylab="", xaxt="n", yaxt="n", + xlim=c(0, 100), ylim=c(0, 100)) +xw <- 10 +yh <- 35 +xm <- 12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".Rmd") + +xm <- 50 +ym <- 80 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".md") +xm <- 50; ym <- 25 +for(i in c(2, 0, -2)) + rect(xm-xw/2+i, ym-yh/2+i, xm+xw/2+i, ym+yh/2+i, lwd=2, + border="black", col="white") +text(xm-2, ym-2, "figs/") + +xm <- 100-12 +ym <- 50 +rect(xm-xw/2, ym-yh/2, xm+xw/2, ym+yh/2, lwd=2) +text(xm, ym, ".html") + +arrows(22, 50, 38, 50, lwd=2, col="slateblue", len=0.1) +text((22+38)/2, 60, "knitr", col="darkslateblue", cex=1.3) + +arrows(62, 50, 78, 50, lwd=2, col="slateblue", len=0.1) +text((62+78)/2, 60, "pandoc", col="darkslateblue", cex=1.3) +``` + +## Chunk options + +There are a variety of options to affect how the code chunks are +treated. Here are some examples: + +- Use `echo=FALSE` to avoid having the code itself shown. +- Use `results="hide"` to avoid having any results printed. +- Use `eval=FALSE` to have the code shown but not evaluated. +- Use `warning=FALSE` and `message=FALSE` to hide any warnings or + messages produced. +- Use `fig.height` and `fig.width` to control the size of the figures + produced (in inches). + +So you might write: + +
``{r load_libraries, echo=FALSE, message=FALSE}
+library("dplyr")
+library("ggplot2")
+```
+
+ +Often there will be particular options that you'll want to use +repeatedly; for this, you can set _global_ chunk options, like so: + +
``{r global_options, echo=FALSE}
+knitr::opts_chunk$set(fig.path="Figs/", message=FALSE, warning=FALSE,
+                      echo=FALSE, results="hide", fig.width=11)
+```
+
+ +The `fig.path` option defines where the figures will be saved. The `/` +here is really important; without it, the figures would be saved in +the standard place but just with names that begin with `Figs`. + +If you have multiple R Markdown files in a common directory, you might +want to use `fig.path` to define separate prefixes for the figure file +names, like `fig.path="Figs/cleaning-"` and `fig.path="Figs/analysis-"`. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 3 + +Use chunk options to control the size of a figure and to hide the +code. + +::::::::::::::: solution + +## Solution to Challenge 3 + +
``{r echo = FALSE, fig.width = 3}
+plot(faithful)
+```
+
+ +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +You can review all of the `R` chunk options by navigating to +the "R Markdown Cheat Sheet" under the "Cheatsheets" section +of the "Help" field in the toolbar at the top of RStudio. + +## Inline R code + +You can make _every_ number in your report reproducible. Use \`r and \` for an in-line code chunk, +like so: ` ``r "r round(some_value, 2)"`` `. The code will be +executed and replaced with the _value_ of the result. + +Don't let these in-line chunks get split across lines. + +Perhaps precede the paragraph with a larger code chunk that does +calculations and defines variables, with `include=FALSE` for that larger +chunk (which is the same as `echo=FALSE` and `results="hide"`). + +Rounding can produce differences in output in such situations. You may want +`2.0`, but `round(2.03, 1)` will give just `2`. + +The +[`myround`](https://github.com/kbroman/broman/blob/master/R/myround.R) +function in the [R/broman](https://github.com/kbroman/broman) package handles +this. + +::::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 4 + +Try out a bit of in-line R code. + +::::::::::::::: solution + +## Solution to Challenge 4 + +Here's some inline code to determine that 2 + 2 = `r 2+2`. + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Other output options + +You can also convert R Markdown to a PDF or a Word document. Click the +little triangle next to the "Knit" button to get a drop-down +menu. Or you could put `pdf_document` or `word_document` in the initial header +of the file. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Creating PDF documents + +Creating .pdf documents may require installation of some extra software. The R +package `tinytex` provides some tools to help make this process easier for R users. +With `tinytex` installed, run `tinytex::install_tinytex()` to install the required +software (you'll only need to do this once) and then when you knit to pdf `tinytex` +will automatically detect and install any additional LaTeX packages that are needed to +produce the pdf document. Visit the [tinytex website](https://yihui.org/tinytex/) +for more information. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: Visual markdown editing in RStudio + +RStudio versions 1.4 and later include visual markdown editing mode. +In visual editing mode, markdown expressions (like `**bold words**`) are +transformed to the formatted appearance (**bold words**) as you type. +This mode also includes a toolbar at the top with basic formatting buttons, +similar to what you might see in common word processing software programs. +You can turn visual editing on and off by pressing +the ![](fig/visual_mode_icon.png){alt='Icon for turning on and off the visual editing mode in RStudio, which looks like a pair of compasses'} +button in the top right corner of your R Markdown document. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Resources + +- [Knitr in a knutshell tutorial](https://kbroman.org/knitr_knutshell) +- [Dynamic Documents with R and knitr](https://www.amazon.com/exec/obidos/ASIN/1482203537/7210-20) (book) +- [R Markdown documentation](https://rmarkdown.rstudio.com) +- [R Markdown cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) +- [Getting started with R Markdown](https://www.rstudio.com/resources/webinars/getting-started-with-r-markdown/) +- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) (book by Rstudio team) +- [Reproducible Reporting](https://www.rstudio.com/resources/webinars/reproducible-reporting/) +- [The Ecosystem of R Markdown](https://www.rstudio.com/resources/webinars/the-ecosystem-of-r-markdown/) +- [Introducing Bookdown](https://www.rstudio.com/resources/webinars/introducing-bookdown/) + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Mix reporting written in R Markdown with software written in R. +- Specify chunk options to control formatting. +- Use `knitr` to convert these documents into PDF and other formats. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/episodes/15-wrap-up.Rmd b/locale/uk/episodes/15-wrap-up.Rmd new file mode 100644 index 000000000..d9fa5b74f --- /dev/null +++ b/locale/uk/episodes/15-wrap-up.Rmd @@ -0,0 +1,110 @@ +--- +title: Writing Good Software +teaching: 15 +exercises: 0 +source: Rmd +--- + +::::::::::::::::::::::::::::::::::::::: objectives + +- Describe best practices for writing R and explain the justification for each. + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::: questions + +- How can I write software that other people can use? + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Structure your project folder + +Keep your project folder structured, organized and tidy, by creating subfolders for your code files, manuals, data, binaries, output plots, etc. It can be done completely manually, or with the help of RStudio's `New Project` functionality, or a designated package, such as `ProjectTemplate`. + +::::::::::::::::::::::::::::::::::::::::: callout + +## Tip: ProjectTemplate - a possible solution + +One way to automate the management of projects is to install the third-party package, `ProjectTemplate`. +This package will set up an ideal directory structure for project management. +This is very useful as it enables you to have your analysis pipeline/workflow organised and structured. +Together with the default RStudio project functionality and Git you will be able to keep track of your +work as well as be able to share your work with collaborators. + +1. Install `ProjectTemplate`. +2. Load the library +3. Initialise the project: + +```{r, eval=FALSE} +install.packages("ProjectTemplate") +library("ProjectTemplate") +create.project("../my_project_2", merge.strategy = "allow.non.conflict") +``` + +For more information on ProjectTemplate and its functionality visit the +home page [ProjectTemplate](https://projecttemplate.net/index.html) + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Make code readable + +The most important part of writing code is making it readable and understandable. +You want someone else to be able to pick up your code and be able to understand +what it does: more often than not this someone will be you 6 months down the line, +who will otherwise be cursing past-self. + +## Documentation: tell us what and why, not how + +When you first start out, your comments will often describe what a command does, +since you're still learning yourself and it can help to clarify concepts and +remind you later. However, these comments aren't particularly useful later on +when you don't remember what problem your code is trying to solve. Try to also +include comments that tell you _why_ you're solving a problem, and _what_ problem +that is. The _how_ can come after that: it's an implementation detail you ideally +shouldn't have to worry about. + +## Keep your code modular + +Our recommendation is that you should separate your functions from your analysis +scripts, and store them in a separate file that you `source` when you open the R +session in your project. This approach is nice because it leaves you with an +uncluttered analysis script, and a repository of useful functions that can be +loaded into any analysis script in your project. It also lets you group related +functions together easily. + +## Break down problem into bite size pieces + +When you first start out, problem solving and function writing can be daunting +tasks, and hard to separate from code inexperience. Try to break down your +problem into digestible chunks and worry about the implementation details later: +keep breaking down the problem into smaller and smaller functions until you +reach a point where you can code a solution, and build back up from there. + +## Know that your code is doing the right thing + +Make sure to test your functions! + +## Don't repeat yourself + +Functions enable easy reuse within a project. If you see blocks of similar +lines of code through your project, those are usually candidates for being +moved into functions. + +If your calculations are performed through a series of functions, then the +project becomes more modular and easier to change. This is especially the case +for which a particular input always gives a particular output. + +## Remember to be stylish + +Apply consistent style to your code. + +:::::::::::::::::::::::::::::::::::::::: keypoints + +- Keep your project folder structured, organized and tidy. +- Document what and why, not how. +- Break programs into short single-purpose functions. +- Write re-runnable tests. +- Don't repeat yourself. +- Be consistent in naming, indentation, and other aspects of style. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/index.md b/locale/uk/index.md new file mode 100644 index 000000000..c434e5efa --- /dev/null +++ b/locale/uk/index.md @@ -0,0 +1,34 @@ +--- +site: sandpaper::sandpaper_site +--- + +_an introduction to R for non-programmers using gapminder data_ + +The goal of this lesson is to teach novice programmers to write modular code +and best practices for using R for data analysis. R is commonly used in many +scientific disciplines for statistical analysis and its array of third-party +packages. We find that many scientists who come to Software Carpentry workshops +use R and want to learn more. The emphasis of these materials is to give +attendees a strong foundation in the fundamentals of R, and to teach best +practices for scientific computing: breaking down analyses into modular units, +task automation, and encapsulation. + +Note that this workshop will focus on teaching the fundamentals of the +programming language R, and will not teach statistical analysis. + +The lesson contains more material than can be taught in a day. The [instructor notes page](instructors/instructor-notes.md) has some suggested lesson plans suitable for a one or half day workshop. + +A variety of third party packages are used throughout this workshop. These +are not necessarily the best, nor are they comprehensive, but they are +packages we find useful, and have been chosen primarily for their +usability. + +:::::::::::::::::::::::::::::::::::::::::: prereq + +## Prerequisites + +Understand that computers store data and instructions (programs, scripts etc.) in files. +Files are organised in directories (folders). +Know how to access files not in the working directory by specifying the path. + +:::::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/locale/uk/instructors/instructor-notes.md b/locale/uk/instructors/instructor-notes.md new file mode 100644 index 000000000..43ffc4c20 --- /dev/null +++ b/locale/uk/instructors/instructor-notes.md @@ -0,0 +1,132 @@ +--- +title: Instructor Notes +--- + +## Timing + +Leave about 30 minutes at the start of each workshop and another 15 mins +at the start of each session for technical difficulties like WiFi and +installing things (even if you asked students to install in advance, longer if +not). + +## Lesson Plans + +The lesson contains much more material than can be taught in a day. +Instructors will need to pick an appropriate subset of episodes to use +in a standard one day course. + +Some suggested paths through the material are: + +(suggested by [@liz-is](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-276529213)) + +- 01 Introduction to R and RStudio +- 04 Data Structures +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 08 Creating Publication-Quality Graphics with ggplot2 +- 10 Functions Explained +- 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +(suggested by [@naupaka](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-312547509)) + +- 01 Introduction to R and RStudio +- 02 Project Management With RStudio +- 03 Seeking Help +- 04 Data Structures +- 05 Exploring Data Frames +- 06 Subsetting Data +- 09 Vectorization +- 08 Creating Publication-Quality Graphics with ggplot2 _OR_ + 13 Dataframe Manipulation with dplyr +- 15 Producing Reports With knitr + +A half day course could consist of (suggested by [@karawoo](https://github.com/swcarpentry/r-novice-gapminder/issues/104#issuecomment-277599864)): + +- 01 Introduction to R and RStudio +- 04 Data Structures (only creating vectors with `c()`) +- 05 Exploring Data Frames ("Realistic example" section onwards) +- 06 Subsetting Data (excluding factor, matrix and list subsetting) +- 08 Creating Publication-Quality Graphics with ggplot2 + +## Setting up git in RStudio + +There can be difficulties linking git to RStudio depending on the +operating system and the version of the operating system. To make sure +Git is properly installed and configured, the learners should go to +the Options window in the RStudio application. + +- **Mac OS X:** + - Go RStudio -> Preferences... -> Git/SVN + - Check and see whether there is a path to a file in the "Git executable" window. If not, the next challenge is figuring out where Git is located. + - In the terminal enter `which git` and you will get a path to the git executable. In the "Git executable" window you may have difficulties finding the directory since OS X hides many of the operating system files. While the file selection window is open, pressing "Command-Shift-G" will pop up a text entry box where you will be able to type or paste in the full path to your git executable: e.g. /usr/bin/git or whatever else it might be. +- **Windows:** + - Go Tools -> Global options... -> Git/SVN + - If you use the Software Carpentry Installer, then 'git.exe' should be installed at `C:/Program Files/Git/bin/git.exe`. + +To prevent the learners from having to re-enter their password each time they push a commit to GitHub, this command (which can be run from a bash prompt) will make it so they only have to enter their password once: + +```bash +$ git config --global credential.helper 'cache --timeout=10000000' +``` + +## RStudio Color Preview + +RStudio has a feature to preview the color for certain named colors and hexadecimal colors. This may confuse or distract learners (and instructors) who are not expecting it. + +Mainly, this is likely to come up during the episode on "Data Structures" with the following code block: + +```r +cats <- data.frame(coat = c("calico", "black", "tabby"), + weight = c(2.1, 5.0, 3.2), + likes_string = c(1, 0, 1)) +``` + +This option can be turned off and on in the following menu setting: +Tools -> Global Options -> Code -> Display -> Enable preview of named and hexadecimal colors (under "Syntax") + +## Pulling in Data + +The easiest way to get the data used in this lesson during a workshop is to have +attendees download the raw data from [gapminder-data] and +[gapminder-data-wide]. + +Attendees can use the `File - Save As` dialog in their browser to save the file. + +## Overall + +Make sure to emphasize good practices: put code in scripts, and make +sure they're version controlled. Encourage students to create script +files for challenges. + +If you're working in a cloud environment, get them to upload the +gapminder data after the second lesson. + +Make sure to emphasize that matrices are vectors underneath the hood +and data frames are lists underneath the hood: this will explain a +lot of the esoteric behaviour encountered in basic operations. + +Vector recycling and function stacks are probably best explained +with diagrams on a whiteboard. + +Be sure to actually go through examples of an R help page: help files +can be intimidating at first, but knowing how to read them is tremendously +useful. + +Be sure to show the CRAN task views, look at one of the topics. + +There's a lot of content: move quickly through the earlier lessons. Their +extensiveness is mostly for purposes of learning by osmosis: so that their +memory will trigger later when they encounter a problem or some esoteric behaviour. + +Key lessons to take time on: + +- Data subsetting - conceptually difficult for novices +- Functions - learners especially struggle with this +- Data structures - worth being thorough, but you can go through it quickly. + +Don't worry about being correct or knowing the material back-to-front. Use +mistakes as teaching moments: the most vital skill you can impart is how to +debug and recover from unexpected errors. + +[gapminder-data]: data/gapminder_data.csv +[gapminder-data-wide]: data/gapminder_wide.csv diff --git a/locale/uk/learners/discuss.md b/locale/uk/learners/discuss.md new file mode 100644 index 000000000..0605730b1 --- /dev/null +++ b/locale/uk/learners/discuss.md @@ -0,0 +1,7 @@ +--- +title: Discussion +--- + +Please see [our other R lesson][r-gap] for a different presentation of these concepts. + +[r-gap]: https://swcarpentry.github.io/r-novice-gapminder/ diff --git a/locale/uk/learners/reference.md b/locale/uk/learners/reference.md new file mode 100644 index 000000000..a4c31f8db --- /dev/null +++ b/locale/uk/learners/reference.md @@ -0,0 +1,342 @@ +--- +title: Reference +--- + +## Reference + +## [Introduction to R and RStudio](episodes/01-rstudio-intro.Rmd) + +- Use the escape key to cancel incomplete commands or running code + (Ctrl+C) if you're using R from the shell. +- Basic arithmetic operations follow standard order of precedence: + - Brackets: `(`, `)` + - Exponents: `^` or `**` + - Divide: `/` + - Multiply: `*` + - Add: `+` + - Subtract: `-` +- Scientific notation is available, e.g: `2e-3` +- Anything to the right of a `#` is a comment, R will ignore this! +- Functions are denoted by `function_name()`. Expressions inside the + brackets are evaluated before being passed to the function, and + functions can be nested. +- Mathematical functions: `exp`, `sin`, `log`, `log10`, `log2` etc. +- Comparison operators: `<`, `<=`, `>`, `>=`, `==`, `!=` +- Use `all.equal` to compare numbers! +- `<-` is the assignment operator. Anything to the right is evaluate, then + stored in a variable named to the left. +- `ls` lists all variables and functions you've created +- `rm` can be used to remove them +- When assigning values to function arguments, you _must_ use `=`. + +## [Project management with RStudio](episodes/02-project-intro.Rmd) + +- To create a new project, go to File -> New Project +- Install the `packrat` package to create self-contained projects +- `install.packages` to install packages from CRAN +- `library` to load a package into R +- `packrat::status` to check whether all packages referenced in your + scripts have been installed. + +## [Seeking help](episodes/03-seeking-help.Rmd) + +- To access help for a function type `?function_name` or `help(function_name)` +- Use quotes for special operators e.g. `?"+"` +- Use fuzzy search if you can't remember a name '??search\_term' +- [CRAN task views](https://cran.at.r-project.org/web/views) are a good starting point. +- [Stack Overflow](https://stackoverflow.com/) is a good place to get help with your code. + - `?dput` will dump data you are working from so others can load it easily. + - `sessionInfo()` will give details of your setup that others may need for debugging. + +## [Data structures](episodes/04-data-structures-part1.Rmd) + +Individual values in R must be one of 5 **data types**, multiple values can be grouped in **data structures**. + +**Data types** + +- `typeof(object)` gives information about an items data type. + +- There are 5 main data types: + + - `?numeric` real (decimal) numbers + - `?integer` whole numbers only + - `?character` text + - `?complex` complex numbers + - `?logical` TRUE or FALSE values + + **Special types:** + + - `?NA` missing values + - `?NaN` "not a number" for undefined values (e.g. `0/0`). + - `?Inf`, `-Inf` infinity. + - `?NULL` a data structure that doesn't exist + + `NA` can occur in any atomic vector. `NaN`, and `Inf` can only + occur in complex, integer or numeric type vectors. Atomic vectors + are the building blocks for all other data structures. A `NULL` value + will occur in place of an entire data structure (but can occur as list + elements). + +**Basic data structures in R:** + +- atomic `?vector` (can only contain one type) +- `?list` (containers for other objects) +- `?data.frame` two dimensional objects whose columns can contain different types of data +- `?matrix` two dimensional objects that can contain only one type of data. +- `?factor` vectors that contain predefined categorical data. +- `?array` multi-dimensional objects that can only contain one type of data + +Remember that matrices are really atomic vectors underneath the hood, and that +data.frames are really lists underneath the hood (this explains some of the weirder +behaviour of R). + +**[Vectors](episodes/04-data-structures-part1.Rmd)** + +- `?vector()` All items in a vector must be the same type. +- Items can be converted from one type to another using _coercion_. +- The concatenate function 'c()' will append items to a vector. +- `seq(from=0, to=1, by=1)` will create a sequence of numbers. +- Items in a vector can be named using the `names()` function. + +**[Factors](episodes/04-data-structures-part1.Rmd)** + +- `?factor()` Factors are a data structure designed to store categorical data. +- `levels()` shows the valid values that can be stored in a vector of type factor. + +**[Lists](episodes/04-data-structures-part1.Rmd)** + +- `?list()` Lists are a data structure designed to store data of different types. + +**[Matrices](episodes/04-data-structures-part1.Rmd)** + +- `?matrix()` Matrices are a data structure designed to store 2-dimensional data. + +**[Data Frames](episodes/05-data-structures-part2.Rmd)** + +- `?data.frame` is a key data structure. It is a `list` of `vectors`. +- `cbind()` will add a column (vector) to a data.frame. +- `rbind()` will add a row (list) to a data.frame. + +**Useful functions for querying data structures:** + +- `?str` structure, prints out a summary of the whole data structure +- `?typeof` tells you the type inside an atomic vector +- `?class` what is the data structure? +- `?head` print the first `n` elements (rows for two-dimensional objects) +- `?tail` print the last `n` elements (rows for two-dimensional objects) +- `?rownames`, `?colnames`, `?dimnames` retrieve or modify the row names + and column names of an object. +- `?names` retrieve or modify the names of an atomic vector or list (or + columns of a data.frame). +- `?length` get the number of elements in an atomic vector +- `?nrow`, `?ncol`, `?dim` get the dimensions of a n-dimensional object + (Won't work on atomic vectors or lists). + +## [Exploring Data Frames](episodes/05-data-structures-part2.Rmd) + +- `read.csv` to read in data in a regular structure + - `sep` argument to specify the separator + - "," for comma separated + - "\\t" for tab separated + - Other arguments: + - `header=TRUE` if there is a header row + +## [Subsetting data](episodes/06-data-subsetting.Rmd) + +- Elements can be accessed by: + + - Index + - Name + - Logical vectors + +- `[` single square brackets: + + - _extract_ single elements or _subset_ vectors + - e.g.`x[1]` extracts the first item from vector x. + - _extract_ single elements of a list. The returned value will be another `list()`. + - _extract_ columns from a data.frame + +- `[` with two arguments to: + + - _extract_ rows and/or columns of + - matrices + - data.frames + - e.g. `x[1,2]` will extract the value in row 1, column 2. + - e.g. `x[2,:]` will extract the entire second column of values. + +- `[[` double square brackets to extract items from lists. + +- `$` to access columns or list elements by name + +- negative indices skip elements + +## [Control flow](episodes/07-control-flow.Rmd) + +- Use `if` condition to start a conditional statement, `else if` condition to provide + additional tests, and `else` to provide a default +- The bodies of the branches of conditional statements must be indented. +- Use `==` to test for equality. +- `%in%` will return a `TRUE`/`FALSE` indicating if there is a match between an element and a vector. +- `X && Y` is only true if both X and Y are `TRUE`. +- `X || Y` is true if either X or Y, or both, are `TRUE`. +- Zero is considered `FALSE`; all other numbers are considered `TRUE` +- Nest loops to operate on multi-dimensional data. + +## [Creating publication quality graphics](episodes/08-plot-ggplot2.Rmd) + +- figures can be created with the grammar of graphics: + - `library(ggplot2)` + - `ggplot` to create the base figure + - `aes`thetics specify the data axes, shape, color, and data size + - `geom`etry functions specify the type of plot, e.g. `point`, `line`, `density`, `box` + - `geom`etry functions also add statistical transforms, e.g. `geom_smooth` + - `scale` functions change the mapping from data to aesthetics + - `facet` functions stratify the figure into panels + - `aes`thetics apply to individual layers, or can be set for the whole plot + inside `ggplot`. + - `theme` functions change the overall look of the plot + - order of layers matters! + - `ggsave` to save a figure. + +## [Vectorization](episodes/09-vectorization.Rmd) + +- Most functions and operations apply to each element of a vector +- `*` applies element-wise to matrices +- `%*%` for true matrix multiplication +- `any()` will return `TRUE` if any element of a vector is `TRUE` +- `all()` will return `TRUE` if _all_ elements of a vector are `TRUE` + +## [Functions explained](episodes/10-functions.Rmd) + +- `?"function"` +- Put code whose parameters change frequently in a function, then call it with + different parameter values to customize its behavior. +- The last line of a function is returned, or you can use `return` explicitly +- Any code written in the body of the function will preferably look for variables defined inside the function. +- Document Why, then What, then lastly How (if the code isn't self explanatory) + +## [Writing data](episodes/11-writing-data.Rmd) + +- `write.table` to write out objects in regular format +- set `quote=FALSE` so that text isn't wrapped in `"` marks + +## [Dataframe manipulation with dplyr](episodes/12-dplyr.Rmd) + +- `library(dplyr)` +- `?select` to extract variables by name. +- `?filter` return rows with matching conditions. +- `?group_by` group data by one of more variables. +- `?summarize` summarize multiple values to a single value. +- `?mutate` add new variables to a data.frame. +- Combine operations using the `?"%>%"` pipe operator. + +## [Dataframe manipulation with tidyr](episodes/13-tidyr.Rmd) + +- `library(tidyr)` +- `?pivot_longer` convert data from _wide_ to _long_ format. +- `?pivot_wider` convert data from _long_ to _wide_ format. +- `?separate` split a single value into multiple values. +- `?unite` merge multiple values into a single value. + +## [Producing reports with knitr](episodes/14-knitr-markdown.Rmd) + +- Value of reproducible reports +- Basics of Markdown +- R code chunks +- Chunk options +- Inline R code +- Other output formats + +## [Best practices for writing good code](episodes/15-wrap-up.Rmd) + +- Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. +- Write tests before writing code in order to help determine exactly what that code is supposed to do. +- Know what code is supposed to do before trying to debug it. +- Make it fail every time. +- Make it fail fast. +- Change one thing at a time, and for a reason. +- Keep track of what you've done. +- Be humble + +## Glossary + +[argument]{#argument} +: A value given to a function or program when it runs. +The term is often used interchangeably (and inconsistently) with [parameter](#parameter). + +[assign]{#assign} +: To give a value a name by associating a variable with it. + +[body]{#body} +: (of a function): the statements that are executed when a function runs. + +[comment]{#comment} +: A remark in a program that is intended to help human readers understand what is going on, +but is ignored by the computer. +Comments in Python, R, and the Unix shell start with a `#` character and run to the end of the line; +comments in SQL start with `--`, +and other languages have other conventions. + +[comma-separated values]{#comma-separated-values} +: (CSV) A common textual representation for tables +in which the values in each row are separated by commas. + +[delimiter]{#delimiter} +: A character or characters used to separate individual values, +such as the commas between columns in a [CSV](#comma-separated-values) file. + +[documentation]{#documentation} +: Human-language text written to explain what software does, +how it works, or how to use it. + +[floating-point number]{#floating-point-number} +: A number containing a fractional part and an exponent. +See also: [integer](#integer). + +[for loop]{#for-loop} +: A loop that is executed once for each value in some kind of set, list, or range. +See also: [while loop](#while-loop). + +[index]{#index} +: A subscript that specifies the location of a single value in a collection, +such as a single pixel in an image. + +[integer]{#integer} +: A whole number, such as -12343. See also: [floating-point number](#floating-point-number). + +[library]{#library} +: In R, the directory(ies) where [packages](#package) are stored. + +[package]{#package} +: A collection of R functions, data and compiled code in a well-defined format. Packages are stored in a [library](#library) and loaded using the library() function. + +[parameter]{#parameter} +: A variable named in the function's declaration that is used to hold a value passed into the call. +The term is often used interchangeably (and inconsistently) with [argument](#argument). + +[return statement]{#return-statement} +: A statement that causes a function to stop executing and return a value to its caller immediately. + +[sequence]{#sequence} +: A collection of information that is presented in a specific order. + +[shape]{#shape} +: An array's dimensions, represented as a vector. +For example, a 5×3 array's shape is `(5,3)`. + +[string]{#string} +: Short for "character string", +a [sequence](#sequence) of zero or more characters. + +[syntax error]{#syntax-error} +: A programming error that occurs when statements are in an order or contain characters +not expected by the programming language. + +[type]{#type} +: The classification of something in a program (for example, the contents of a variable) +as a kind of number (e.g. [floating-point number](#floating-point-number), [integer](#integer)), [string](#string), +or something else. In R the command typeof() is used to query a variables type. + +[while loop]{#while-loop} +: A loop that keeps executing as long as some condition is true. +See also: [for loop](#for-loop). diff --git a/locale/uk/learners/setup.md b/locale/uk/learners/setup.md new file mode 100644 index 000000000..736e10764 --- /dev/null +++ b/locale/uk/learners/setup.md @@ -0,0 +1,8 @@ +--- +title: Setup +--- + +This lesson assumes you have R and RStudio installed on your computer. + +- [Download and install the latest version of R](https://www.r-project.org/). +- [Download and install RStudio](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features. You will need the free Desktop version for your computer. diff --git a/locale/uk/profiles/learner-profiles.md b/locale/uk/profiles/learner-profiles.md new file mode 100644 index 000000000..75b2c5cad --- /dev/null +++ b/locale/uk/profiles/learner-profiles.md @@ -0,0 +1,5 @@ +--- +title: FIXME +--- + +This is a placeholder file. Please add content here.