diff --git a/docs/07-durable-code.md b/docs/07-durable-code.md
index 59d6f6dd..99d2064f 100644
--- a/docs/07-durable-code.md
+++ b/docs/07-durable-code.md
@@ -139,7 +139,7 @@ Keeping around old code and objects is generally more of a hindrance than a time
1) You might write better code on the second try (or third or n'th).
2) Keeping around old code makes it harder for you to write and troubleshoot new better code -- it's easier to confuse yourself. Sometimes a fresh start can be what you need.
-3) With version control you can always return to that old code! (We'll dive more into version control later on, but you've started the process by [uploading your code to GitHub in chapter 4](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/why-git-and-github.html)!)
+3) With version control you can always return to that old code! (We'll dive more into version control later on, but you've started the process by [uploading your code to GitHub in chapter 4](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/making-your-project-open-source-with-github.html)!)
This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it).
@@ -306,7 +306,7 @@ There's so many opinions and strategies on best practices for code. And although
#### R specific:
-- [Data Carpentry's: Best Practices for Writing R Code](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/) by @DataCarpentry2021b.
+- [Data Carpentry's: Best Practices for Writing R Code](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R.html) by @DataCarpentry2021b.
- [R Programming for Research: Reproducible Research](https://geanders.github.io/RProgrammingForResearch/reproducible-research-1.html) by @Good2021.
- [R for Epidemiology: Coding best practices](https://www.r4epi.com/coding-best-practices.html) by @Cannell2021.
- [Best practices for R Programming](https://towardsdatascience.com/best-practices-for-r-programming-ec0754010b5a) by @Bernardo2021.
diff --git a/docs/404.html b/docs/404.html
index 372099e0..eff287e7 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -312,13 +312,18 @@
Page not found
You may want to try searching to find the page's new location, or use
the table of contents to find the page you are looking for.
+
+
+
+
+
+
+
+
diff --git a/docs/About.md b/docs/About.md
index 55e0d20d..93fbbdd1 100644
--- a/docs/About.md
+++ b/docs/About.md
@@ -46,7 +46,7 @@ These credits are based on our [course contributors table guidelines](https://ww
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2022-10-13
+## date 2024-03-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
diff --git a/docs/Introduction-to-Reproducibility.docx b/docs/Introduction-to-Reproducibility.docx
index 35621e21..f09f94e0 100644
Binary files a/docs/Introduction-to-Reproducibility.docx and b/docs/Introduction-to-Reproducibility.docx differ
diff --git a/docs/about-the-authors.html b/docs/about-the-authors.html
index 4869f9d9..b7cfdace 100644
--- a/docs/about-the-authors.html
+++ b/docs/about-the-authors.html
@@ -428,7 +428,7 @@ About the Authors
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2022-10-13
+## date 2024-03-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
@@ -479,13 +479,18 @@ About the Authors
+
+
+
+
+
+
+
+
diff --git a/docs/code-review.html b/docs/code-review.html
index e1511ba2..301eb045 100644
--- a/docs/code-review.html
+++ b/docs/code-review.html
@@ -407,13 +407,18 @@ References
Team, Smartbear. 2021. “Best Practices for Code Review.” Smartbear.com. https://smartbear.com/en/learn/code-review/best-practices-for-peer-code-review/.
+
+
+
+
+
+
+
+
diff --git a/docs/defining-reproducibility.html b/docs/defining-reproducibility.html
index ab1584b6..ae0c9968 100644
--- a/docs/defining-reproducibility.html
+++ b/docs/defining-reproducibility.html
@@ -387,13 +387,18 @@ References
Broman, Karl. 2016. “Tools for Reproducible Research.” https://kbroman.org/Tools4RR/.
+
+
+
+
+
+
+
+
diff --git a/docs/documenting-analyses.html b/docs/documenting-analyses.html
index dfb18bff..c005efc4 100644
--- a/docs/documenting-analyses.html
+++ b/docs/documenting-analyses.html
@@ -367,14 +367,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -418,13 +418,18 @@
Exercise 2: Write a README fo
+
+
+
+
+
+
+
+
diff --git a/docs/index.html b/docs/index.html
index dd103879..816df1f1 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -308,7 +308,7 @@
About this Course
@@ -325,13 +325,18 @@ Available course formats
+
+
+
+
+
+
+
+
diff --git a/docs/index.md b/docs/index.md
index 434ae074..b1d811bb 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,6 +1,6 @@
---
title: "Intro to Reproducibility in Cancer Informatics"
-date: "October, 2022"
+date: "March, 2024"
site: bookdown::bookdown_site
documentclass: book
bibliography: [book.bib, packages.bib]
diff --git a/docs/introduction.html b/docs/introduction.html
index cba333a8..6985e6fa 100644
--- a/docs/introduction.html
+++ b/docs/introduction.html
@@ -358,13 +358,18 @@ References
Beaulieu-Jones, Brett K, and Casey S Greene. 2017. “Reproducibility of Computational Workflows Is Automated Using Continuous Analysis.” Nature Biotechnology 35 (4): 342–46. https://doi.org/10.1038/nbt.3780.
+
+
+
+
+
+
+
+
diff --git a/docs/making-your-project-open-source-with-github.html b/docs/making-your-project-open-source-with-github.html
index c31c55e1..69d3daae 100644
--- a/docs/making-your-project-open-source-with-github.html
+++ b/docs/making-your-project-open-source-with-github.html
@@ -352,14 +352,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -383,13 +383,18 @@
Exercise: Set up a project on
+
+
+
+
+
+
+
+
diff --git a/docs/managing-package-versions.html b/docs/managing-package-versions.html
index 43143be0..d7b17614 100644
--- a/docs/managing-package-versions.html
+++ b/docs/managing-package-versions.html
@@ -342,14 +342,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -485,13 +485,18 @@
References
Beaulieu-Jones, Brett K, and Casey S Greene. 2017.
“Reproducibility of Computational Workflows Is Automated Using Continuous Analysis.” Nature Biotechnology 35 (4): 342–46.
https://doi.org/10.1038/nbt.3780.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/07-durable-code.md b/docs/no_toc/07-durable-code.md
index 774935a9..99ec386c 100644
--- a/docs/no_toc/07-durable-code.md
+++ b/docs/no_toc/07-durable-code.md
@@ -139,7 +139,7 @@ Keeping around old code and objects is generally more of a hindrance than a time
1) You might write better code on the second try (or third or n'th).
2) Keeping around old code makes it harder for you to write and troubleshoot new better code -- it's easier to confuse yourself. Sometimes a fresh start can be what you need.
-3) With version control you can always return to that old code! (We'll dive more into version control later on, but you've started the process by [uploading your code to GitHub in chapter 4](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/why-git-and-github.html)!)
+3) With version control you can always return to that old code! (We'll dive more into version control later on, but you've started the process by [uploading your code to GitHub in chapter 4](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/making-your-project-open-source-with-github.html)!)
This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it).
@@ -306,7 +306,7 @@ There's so many opinions and strategies on best practices for code. And although
#### R specific:
-- [Data Carpentry's: Best Practices for Writing R Code](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/) by @DataCarpentry2021b.
+- [Data Carpentry's: Best Practices for Writing R Code](https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R.html) by @DataCarpentry2021b.
- [R Programming for Research: Reproducible Research](https://geanders.github.io/RProgrammingForResearch/reproducible-research-1.html) by @Good2021.
- [R for Epidemiology: Coding best practices](https://www.r4epi.com/coding-best-practices.html) by @Cannell2021.
- [Best practices for R Programming](https://towardsdatascience.com/best-practices-for-r-programming-ec0754010b5a) by @Bernardo2021.
diff --git a/docs/no_toc/404.html b/docs/no_toc/404.html
index 372099e0..eff287e7 100644
--- a/docs/no_toc/404.html
+++ b/docs/no_toc/404.html
@@ -312,13 +312,18 @@ Page not found
You may want to try searching to find the page's new location, or use
the table of contents to find the page you are looking for.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/About.md b/docs/no_toc/About.md
index 9d6605a8..21490ec4 100644
--- a/docs/no_toc/About.md
+++ b/docs/no_toc/About.md
@@ -46,7 +46,7 @@ These credits are based on our [course contributors table guidelines](https://ww
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2022-10-13
+## date 2024-03-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
diff --git a/docs/no_toc/about-the-authors.html b/docs/no_toc/about-the-authors.html
index 8b07ba57..b9bc1adb 100644
--- a/docs/no_toc/about-the-authors.html
+++ b/docs/no_toc/about-the-authors.html
@@ -428,7 +428,7 @@ About the Authors
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
-## date 2022-10-13
+## date 2024-03-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
@@ -486,13 +486,18 @@ About the Authors
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/code-review.html b/docs/no_toc/code-review.html
index e1511ba2..301eb045 100644
--- a/docs/no_toc/code-review.html
+++ b/docs/no_toc/code-review.html
@@ -407,13 +407,18 @@ References
Team, Smartbear. 2021. “Best Practices for Code Review.” Smartbear.com. https://smartbear.com/en/learn/code-review/best-practices-for-peer-code-review/.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/defining-reproducibility.html b/docs/no_toc/defining-reproducibility.html
index ab1584b6..ae0c9968 100644
--- a/docs/no_toc/defining-reproducibility.html
+++ b/docs/no_toc/defining-reproducibility.html
@@ -387,13 +387,18 @@ References
Broman, Karl. 2016. “Tools for Reproducible Research.” https://kbroman.org/Tools4RR/.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/documenting-analyses.html b/docs/no_toc/documenting-analyses.html
index dfb18bff..c005efc4 100644
--- a/docs/no_toc/documenting-analyses.html
+++ b/docs/no_toc/documenting-analyses.html
@@ -367,14 +367,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -418,13 +418,18 @@
Exercise 2: Write a README fo
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/index.html b/docs/no_toc/index.html
index dd103879..816df1f1 100644
--- a/docs/no_toc/index.html
+++ b/docs/no_toc/index.html
@@ -308,7 +308,7 @@
About this Course
@@ -325,13 +325,18 @@ Available course formats
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/index.md b/docs/no_toc/index.md
index 434ae074..b1d811bb 100644
--- a/docs/no_toc/index.md
+++ b/docs/no_toc/index.md
@@ -1,6 +1,6 @@
---
title: "Intro to Reproducibility in Cancer Informatics"
-date: "October, 2022"
+date: "March, 2024"
site: bookdown::bookdown_site
documentclass: book
bibliography: [book.bib, packages.bib]
diff --git a/docs/no_toc/introduction.html b/docs/no_toc/introduction.html
index cba333a8..6985e6fa 100644
--- a/docs/no_toc/introduction.html
+++ b/docs/no_toc/introduction.html
@@ -358,13 +358,18 @@ References
Beaulieu-Jones, Brett K, and Casey S Greene. 2017. “Reproducibility of Computational Workflows Is Automated Using Continuous Analysis.” Nature Biotechnology 35 (4): 342–46. https://doi.org/10.1038/nbt.3780.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/making-your-project-open-source-with-github.html b/docs/no_toc/making-your-project-open-source-with-github.html
index c31c55e1..69d3daae 100644
--- a/docs/no_toc/making-your-project-open-source-with-github.html
+++ b/docs/no_toc/making-your-project-open-source-with-github.html
@@ -352,14 +352,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -383,13 +383,18 @@
Exercise: Set up a project on
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/managing-package-versions.html b/docs/no_toc/managing-package-versions.html
index 43143be0..d7b17614 100644
--- a/docs/no_toc/managing-package-versions.html
+++ b/docs/no_toc/managing-package-versions.html
@@ -342,14 +342,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -485,13 +485,18 @@
References
Beaulieu-Jones, Brett K, and Casey S Greene. 2017.
“Reproducibility of Computational Workflows Is Automated Using Continuous Analysis.” Nature Biotechnology 35 (4): 342–46.
https://doi.org/10.1038/nbt.3780.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/organizing-your-project.html b/docs/no_toc/organizing-your-project.html
index ba6b188d..071a271e 100644
--- a/docs/no_toc/organizing-your-project.html
+++ b/docs/no_toc/organizing-your-project.html
@@ -410,14 +410,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/references.html b/docs/no_toc/references.html
index f5e92b73..b9150c58 100644
--- a/docs/no_toc/references.html
+++ b/docs/no_toc/references.html
@@ -498,13 +498,18 @@ References
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png b/docs/no_toc/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png
index eb5c3a98..781c1ada 100644
Binary files a/docs/no_toc/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png and b/docs/no_toc/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png differ
diff --git a/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png b/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png
index 3493e335..b96bc1fc 100644
Binary files a/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png and b/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png differ
diff --git a/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png b/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png
index e631c7d5..8e97ff1c 100644
Binary files a/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png and b/docs/no_toc/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png differ
diff --git a/docs/no_toc/search_index.json b/docs/no_toc/search_index.json
index cddb50b5..d85ec678 100644
--- a/docs/no_toc/search_index.json
+++ b/docs/no_toc/search_index.json
@@ -1 +1 @@
-[["index.html", "Intro to Reproducibility in Cancer Informatics About this Course 0.1 Available course formats", " Intro to Reproducibility in Cancer Informatics October, 2022 About this Course This course is part of a series of courses for the Informatics Technology for Cancer Research (ITCR) called the Informatics Technology for Cancer Research Education Resource. This material was created by the ITCR Training Network (ITN) which is a collaborative effort of researchers around the United States to support cancer informatics and data science training through resources, technology, and events. This initiative is funded by the following grant: National Cancer Institute (NCI) UE5 CA254170. Our courses feature tools developed by ITCR Investigators and make it easier for principal investigators, scientists, and analysts to integrate cancer informatics into their workflows. Please see our website at www.itcrtraining.org for more information. 0.1 Available course formats This course is available in multiple formats which allows you to take it in the way that best suites your needs. You can take it for certificate which can be for free or fee. The material for this course can be viewed without login requirement on this Bookdown website. This format might be most appropriate for you if you rely on screen-reader technology. This course can be taken for free certification through Leanpub. This course can be taken on Coursera for certification here (but it is not available for free on Coursera). Our courses are open source, you can find the source material for this course on GitHub. "],["introduction.html", "Chapter 1 Introduction 1.1 Target Audience 1.2 Topics covered: 1.3 Motivation 1.4 Curriculum 1.5 How to use the course", " Chapter 1 Introduction 1.1 Target Audience The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods. This course is written for individuals who: Have some familiarity with R or Python - have written some scripts. Have not had formal training in computational methods. Have limited or no familiar with GitHub, Docker, or package management tools. 1.2 Topics covered: This is a two part series: 1.3 Motivation Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods. Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (Beaulieu-Jones and Greene 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers’ time so they don’t have to reinvent the proverbial wheel for methods that everyone in the field is already performing. 1.4 Curriculum This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub. The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. Goal of this course: Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an “ease into it” approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses. What is not the goal This course is meant to introduce learners to the reproducibility tools, but it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally “better practices”. 1.5 How to use the course This course is designed with busy professional learners in mind – who may have to pick up and put down the course when their schedule allows. Each exercise has the option for you to continue along with the example files as you’ve been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you’ve been working on no longer make sense, you have a fresh starting point at each exercise. References "],["defining-reproducibility.html", "Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility 2.3 Reproducibility in daily life 2.4 Reproducibility is worth the effort! 2.5 Reproducibility exists on a continuum!", " Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility There’s been a lot of discussion about what is included in the term reproducibility and there is some discrepancy between fields. For the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found. Reproducibility is related to repeatability and replicability but it is worth taking time to differentiate these terms Perhaps you are like Ruby and have just found an interesting pattern through your data analysis! This has probably been the result of many months or years on your project and it’s worth celebrating! But before she considers these results a done deal, Ruby should test whether she is able to re-run her own analysis and get the same results again. This is known as repeatability. Given that Ruby’s analysis is repeatable; she may feel confident now to share her preliminary results with her colleague, Avi the Associate. Whether or not someone else will be able to take Ruby’s code and data, re-run the analysis and obtain the same results is known as reproducibility. If Ruby’s results are able to be reproduced by Avi, now Avi may collect new data and use Ruby’s same analysis methods to analyze his data. Whether or not Avi’s new data and results concur with Ruby’s study’s original inferences is known as replicability. You may realize that these levels of research build on each other (like science is supposed to do). In this way, we can think of these in a hierarchy. Skipping any of these levels of research applicability can lead to unreliable results and conclusions. Science progresses when data and hypotheses are put through these levels thoroughly and sequentially. If results are not repeatable, they won’t be reproducible or replicable. Ideally all analyses and results would be reproducible without too much time and effort spent; this would aid in the efficiency of research getting to the next stages and questions. But unfortunately, in practice, reproducibility is not as commonplace as we would hope. Institutions and reward systems generally do not prioritize or even measure reproducibility standards in research and training opportunities for reproducible techniques can be scarce. Reproducible research can often feel like an uphill battle that is made steeper by lack of training opportunities. In this course, we hope to equip your research with the tools you need to enhance the reproducibility of your analyses so this uphill battle is less steep. 2.3 Reproducibility in daily life What does reproducibility mean in the daily life of a researcher? Let’s say Ruby’s results are repeatable in her own hands and she excitedly tells her associate, Avi, about her preliminary findings. Avi is very excited about these results as well as Ruby’s methods! Avi is also interested in Ruby’s analysis methods and results. So Ruby sends Avi the code and data she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby’s analysis is reproducible. Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi’s computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author (Beaulieu-Jones and Greene 2017). Avi is encountering errors because Ruby’s code was written with Ruby’s computer and local setup in mind and she didn’t know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby’s same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors). Avi is still struggling to work with Ruby’s code and is confused about the goals and approaches the code is taking. After struggling with Avi’s code for an untold amount of time, Avi may decide it’s time to email Ruby to get some clarity. Now both Avi and Ruby are confused about why this analysis isn’t nicely re-running for Avi. Their attempts to communicate about the code through email haven’t helped them clarify anything. Multiple versions of the code may have been sent back and forth between them and now things are taking a lot more time than either of them expected. Perhaps at some point Avi is able to successfully run Ruby’s code on Ruby’s same data. Just because Avi didn’t get any errors doesn’t mean that the code ran exactly the same as it did for Ruby. Lack of errors also doesn’t mean that either Ruby or Avi’s runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments. 2.4 Reproducibility is worth the effort! Perhaps you’ve found yourself in a situation like Ruby and Avi; struggling to re-run code that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects’ reproducibility. As you apply these reproducible techniques to your own projects, you may feel like it is taking more time to reach endpoints, but keep in mind that reproducible analyses and projects have higher upfront costs but these will absolutely pay off in the long term. Reproducibility in your analyses is not only a time saver for yourself, but also your colleagues, your field, and your future self! You might not change a single character in your code but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code stands the test of time longer, making ‘future you’ glad you spent the time to work on it. It’s said that your closest collaborator is you from 6 months ago but you don’t reply to email (Broman 2016). Many a data scientist has referred to their frustration with their past selves: Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley — Hadley Wickham (@hadleywickham) April 7, 2016 The more you comment your code, and make it clear and readable, your future self will thank you. Reproducible code also saves your colleagues time! The more reproducible your code is, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your code and need to try to fix it, the more time is wasted. This can add up to a lot of wasted researcher time and effort. But, reproducible code saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your code and analyses in the future! 2.5 Reproducibility exists on a continuum! Incremental work on your analyses is good! You do not need to make your analyses perfect on the first try or even within a particular time frame. The first step in creating an analysis is to get it to work once! But the work does not end there. Furthermore, no analysis is or will ever be perfect in that it will not be reproducible in every single context throughout time. incrementally pushing our analyses toward the right of this continuum is the goal. References "],["organizing-your-project.html", "Chapter 3 Organizing your project 3.1 Learning Objectives 3.2 Organizational strategies 3.3 Readings about organizational strategies for data science projects: 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) 3.5 Exercise: Organize your project!", " Chapter 3 Organizing your project 3.1 Learning Objectives Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped. Tayo (2019) discusses four particular reasons why it is important to organize your project: Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. Organization is yet another aspect of reproducibility that saves you and your colleagues time! 3.2 Organizational strategies There’s a lot of ways to keep your files organized, and there’s not a “one size fits all” organizational solution (Shapiro et al. 2021). In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. The most important aspects of your project organization scheme is that it: Is project-oriented (Bryan 2017). Follows consistent patterns (Shapiro et al. 2021). Is easy for you and others to find the files you need quickly (Shapiro et al. 2021). Minimizes the likelihood for errors (like writing over files accidentally) (Shapiro et al. 2021). Is something maintainable (Shapiro et al. 2021)! 3.2.1 Tips for organizing your project: Getting more specific, here’s some ideas of how to organize your project: Make file names informative to those who don’t have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders – these only serve to make reading in files a nightmare in some programs. Number scripts in the order that they are run. Keep like-files together in their own directory: results tables with other results tables, etc. Including most importantly keeping raw data separate from processed data or other results! Put source scripts and functions in their own directory. Things that should never need to be called directly by yourself or anyone else. Put output in its own directories like results and plots. Have a central document (like a README) that describes the basic information about the analysis and how to re-run it. Make it easy on yourself, dates aren’t necessary. The computer keeps track of those. Make a central script that re-runs everything – including the creation of the folders! (more on this in a later chapter) Let’s see what these principles might look like put into practice. 3.2.1.1 Example organizational scheme Here’s an example of what this might look like: project-name/ ├── run_analysis.sh ├── 00-download-data.sh ├── 01-make-heatmap.Rmd ├── README.md ├── plots/ │ └── project-name-heatmap.png ├── results/ │ └── top_gene_results.tsv ├── raw-data/ │ ├── project-name-raw.tsv │ └── project-name-metadata.tsv ├── processed-data/ │ ├── project-name-quantile-normalized.tsv └── util/ ├── plotting-functions.R └── data-wrangling-functions.R What these hypothetical files and folders contain: run_analysis.sh - A central script that runs everything again 00-download-data.sh - The script that needs to be run first and is called by run_analysis.sh 01-make-heatmap.Rmd - The script that needs to be run second and is also called by run_analysis.sh README.md - The document that has the information that will orient someone to this project, we’ll discuss more about how to create a helpful README in an upcoming chapter. plots - A folder of plots and resulting images results - A folder results raw-data - Data files as they first arrive and nothing has been done to them yet. processed-data - Data that has been modified from the raw in some way. util - A folder of utilities that never needs to be called or touched directly unless troubleshooting something 3.3 Readings about organizational strategies for data science projects: But you don’t have to take my organizational strategy, there are lots of ideas out there. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: Jenny Bryan’s organizational strategies (Bryan and Hester 2021). Danielle Navarro’s organizational strategies Navarro (2021) Jenny Bryan on Project-oriented workflows(Bryan 2017). Data Carpentry mini-course about organizing projects (“Project Organization and Management for Genomics” 2021). Andrew Severin’s strategy for organization (Severin 2021). A BioStars thread where many individuals share their own organizational strategies (“How Do You Manage Your Files & Directories for Your Projects?” 2010). Data Carpentry course chapter about getting organized (“Introduction to the Command Line for Genomics” 2019). 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 3.5 Exercise: Organize your project! Using your computer’s GUI (drag, drop, and clicking), organize the files that are part of this project. Organized these files using an organizational scheme similar to what is described above. Create folders like plots, results, and data folder. Note that aggregated_metadata.json and LICENSE.TXT also belong in the data folder. You will want to delete any files that say “OLD”. Keeping multiple versions of your scripts around is a recipe for mistakes and confusion. In the advanced course we will discuss how to use version control to help you track this more elegantly. After your files are organized, you are ready to move on to the next chapter and create a notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["making-your-project-open-source-with-github.html", "Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) 4.3 Exercise: Set up a project on GitHub", " Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git. All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy. There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files. 4.1.1 GitHub and git allow you to… 4.1.1.1 Maintain transparent analyses Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time! 4.1.1.2 Have backups of your code and analyses at every point Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses. 4.1.1.3 Keep a documented history of your project Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered. 4.1.1.4 Collaborate with others Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way. 4.1.1.5 Experiment with your analysis Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture. 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 4.3 Exercise: Set up a project on GitHub Go here for the video tutorial version of this exercise. Now that we understand how useful GitHub is for creating reproducible analyses, it’s time to set ourselves up on GitHub. Git and GitHub have a whole rich world of tools and terms that can get complex quickly, but for this exercise, we will not worry about those terms and functionalities just yet, but focus on getting code up on GitHub so we are ready to collaborate and conduct open analyses! Go to Github’s main page and click Sign Up if you don’t have an account. Follow these instructions to create a repository. As a general, but not absolute rule, you will want to keep one GitHub repository for one analysis project. Name the repository something that reminds you what its related to. Choose Public. Check the box that says Add a README. Follow these instructions to add the example files you downloaded to your new repository. Congrats! You’ve started your very own project on GitHub! We encourage you to do the same with your own code and other projects! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["using-notebooks.html", "Chapter 5 Using Notebooks 5.1 Learning Objectives 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) 5.3 Exercise: Convert code into a notebook!", " Chapter 5 Using Notebooks 5.1 Learning Objectives Notebooks are a handy way to have the code, output, and scientist’s thought process all documented in one place that is easy for others to read and follow. The notebook environment is incredibly useful for reproducible data science for a variety of reasons: 5.1.0.1 Reason 1: Notebooks allow for tracking data exploration and encourage the scientist to narrate their thought process: Each executed code cell is an attempt by the researcher to achieve something and to tease out some insight from the data set. The result is displayed immediately below the code commands, and the researcher can pause and think about the outcome. As code cells can be executed in any order, modified and re-executed as desired, deleted and copied, the notebook is a convenient environment to iteratively explore a complex problem. (Fangohr 2021) 5.1.0.2 Reason 2: Notebooks allow for easy sharing of results: Notebooks can be converted to html and pdf, and then shared as static read-only documents. This is useful to communicate and share a study with colleagues or managers. By adding sufficient explanation, the main story can be understood by the reader, even if they wouldn’t be able to write the code that is embedded in the document. (Fangohr 2021) 5.1.0.3 Reason 3: Notebooks can be re-ran as a script or developed interactively: A common pattern in science is that a computational recipe is iteratively developed in a notebook. Once this has been found and should be applied to further data sets (or other points in some parameter space), the notebook can be executed like a script, for example by submitting these scripts as batch jobs. (Fangohr 2021) This can also be handy especially if you use automation to enhance the reproducibility of your analyses (something we will talk about in the advanced part of this course). Because of all of these reasons, we encourage the use of computational notebooks as a means of enhancing reproducibility. (This course itself is also written with the use of notebooks!) 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 5.3 Exercise: Convert code into a notebook! 5.3.1 Set up your IDE For this chapter, we will create notebooks from our example files code. Notebooks work best with the integrated development environment (IDE) they were created to work with. IDE’s are sets of tools that help you develop your code. They are part “point and click” and part command line and include lots of visuals that will help guide you. Set up a Python IDE Install JupyterLab We advise using the conda method to install JupyterLab, because we will return to talk more about conda later on, so if you don’t have conda, you will need to install that first. We advise going with Anaconda instead of miniconda. To install Anaconda you can download from here. Download the installer, and follow the installation prompts. Start up Anaconda navigator. On the home page choose JupyterLab and click Install. This may take a few minutes. Now you should be able to click Launch underneath JupyterLab. This will open up a page in your Browser with JupyterLab. Getting familiar with JupyterLab’s interface The JupyterLab interface consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and a menu bar. The left sidebar contains a file browser, the list of running kernels and terminals, the command palette, the notebook cell tools inspector, and the tabs list. The menu bar at the top of JupyterLab has top-level menus that expose actions available in JupyterLab with their keyboard shortcuts. The default menus are: File: actions related to files and directories Edit: actions related to editing documents and other activities View: actions that alter the appearance of JupyterLab Run: actions for running code in different activities such as notebooks and code consoles Kernel: actions for managing kernels, which are separate processes for running code Tabs: a list of the open documents and activities in the dock panel Settings: common settings and an advanced settings editor Help: a list of JupyterLab and kernel help links Set up an R IDE Install RStudio Install RStudio (and install R first if you have not already). After you’ve downloaded the RStudio installation file, double click on it and follow along with the installation prompts. Open up the RStudio application by double clicking on it. Getting familiar with RStudio’s interface The RStudio environment has four main panes, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout). The Editor pane is where you can write R scripts and other documents. Each tab here is its own document. This is your text editor, which will allow you to save your R code for future use. Note that change code here will not run automatically until you run it. The Console pane is where you can interactively run R code. There is also a Terminal tab here which can be used for running programs outside R on your computer The Environment pane primarily displays the variables, sometimes known as objects that are defined during a given R session, and what data or values they might hold. The Help viewer pane has several tabs all of which are pretty important: The Files tab shows the structure and contents of files and folders (also known as directories) on your computer. The Plots tab will reveal plots when you make them The Packages tab shows which installed packages have been loaded into your R session The Help tab will show the help page when you look up a function The Viewer pane will reveal compiled R Markdown documents From Shapiro et al. (2021) More reading about RStudio’s interface: RStudio IDE Cheatsheet (pdf). Navigating the RStudio Interface - R for Epidemiology 5.3.2 Create a notebook! Now, in your respective IDE, we’ll turn our unreproducible scripts into notebooks. In the next chapter we will begin to dive into the code itself, but for now, we’ll get the notebook ready to go. Set up a Python notebook Start a new notebook by going to New > Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file. Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file.” style=“display: block; margin: auto;” /> Create a new code chunk in your notebook. Now copy and paste all of the code from make-heatmap.py into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. Save your Untitled.ipynb file as something that tells us what it will end up doing like make-heatmap.ipynb. For more about using Jupyter notebooks see this by Mike (2021). Set up an R notebook Start a new notebook by going to File > New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file. New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file.” style=“display: block; margin: auto;” /> Practice creating a new chunk in your R notebook by clicking the Code > Insert Chunk button on the toolbar or by pressing Cmd+Option+I (in Mac) or Ctrl + Alt + I (in Windows). (You can also manually type out the back ticks and {}) Delete all the default text in this notebook but keep the header which is surrounded by --- and looks like: title: "R Notebook" output: html_notebook You can feel free to change the title from R Notebook to something that better suits the contents of this notebook. 5. Now copy and paste all of the code from make_heatmap.R into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. 6. Save your untitled.Rmd into something that tells us what it will end up doing like make-heatmap.Rmd. 7. Notice that upon saving your .Rmd file, a new file .nb.html file of the same name is created. Open that file and choose view in Browser. If RStudio asks you to choose a browser, then choose a default browser. 8. This shows the nicely rendered version of your analysis and snapshots whatever output existed when the .Rmd file was saved. For more about using R notebooks see this by Xie, Allaire, and Grolemund (2018). Now that you’ve created your notebook, you are ready to start polishing that code! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["managing-package-versions.html", "Chapter 6 Managing package versions 6.1 Learning Objectives 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) 6.3 Exercise 1: Print out session info 6.4 Exercise 2: Package management", " Chapter 6 Managing package versions 6.1 Learning Objectives As we discussed previously, sometimes two different researchers can run the same code and same data and get different results! What Ruby and Avi may not realize is that although they may have used the same code and data, the software packages that they have on each of their computers might be very different. Even if they have the same software packages, they likely don’t have the same versions and versions can influence results! Different computing environments are not only a headache to detangle, they also can influence the reproducibility of your results (Beaulieu-Jones and Greene 2017). There are multiple ways to deal with variations in computing environments so that your analyses will be reproducible and we will discuss a few different strategies for tackling this problem in this course and its follow up course. But for now, we will start with the least intensive to implement: session info. There are two strategies for dealing with software versions that we will discuss in this chapter. Either of these strategies can be used alone or you can use both. They address different aspects of the computing environment discrepancy problem. 6.1.1 Strategy 1: Session Info - record a list of your packages One strategy to combat different software versions is to list the session info. This is the easiest (though not most comprehensive) method for handling differences in software versions is to have your code list details about your computing environment. Session info can lead to clues as to why results weren’t reproducible. For example, if both Avi and Ruby ran notebooks and included a session info print out it may look like this: Session info shows us that they have different R versions and different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions. If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues for any discrepancies. This can give them items to look into for determining why the results didn’t reproduce as expected. 6.1.2 Strategy 2: Package managers - share a useable snapshot of your environment Package managers can help handle your computing environment for you in a way that you can share them with others. In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. For R and Python versions of the exercises, we will be using different managers, but the foundational strategy will be the same: include a file that someone else could replicate your package set up from. For both exercises, we will download an environment ‘snapshot’ file we’ve set up for you, then we will practice adding a new package to the environments we’ve provided, and add them to your new repository along with the rest of your example project files. For Python, we’ll use conda for package management and store this information in a environment.yml file. For R, we’ll use renv for package management and store this information in a renv.lock file. 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 6.3 Exercise 1: Print out session info Python version of the exercise In your scientific notebook, you’ll need to add two items. 1. Add the import session_info to a code chunk at the beginning of your notebook. 2. Add session_info.show() to a new code chunk at the very end of your notebook. 2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. R version of the exercise In your Rmd file, add a chunk in the very end that looks like this: ```r sessionInfo() ``` ``` ## R version 4.0.2 (2020-06-22) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.3 LTS ## ## Matrix products: default ## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] knitr_1.33 magrittr_2.0.2 hms_0.5.3 R6_2.4.1 ## [5] rlang_0.4.10 highr_0.8 stringr_1.4.0 httr_1.4.2 ## [9] tools_4.0.2 xfun_0.26 jquerylib_0.1.4 htmltools_0.5.0 ## [13] ellipsis_0.3.1 ottrpal_0.1.2 yaml_2.2.1 digest_0.6.25 ## [17] tibble_3.0.3 lifecycle_1.0.0 crayon_1.3.4 bookdown_0.24 ## [21] readr_1.4.0 vctrs_0.3.4 fs_1.5.0 curl_4.3 ## [25] evaluate_0.14 rmarkdown_2.10 stringi_1.5.3 compiler_4.0.2 ## [29] pillar_1.4.6 pkgconfig_2.0.3 ``` Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. 6.4 Exercise 2: Package management Python version of the exercise Download this starter conda environment.yml file by clicking on the link and place it with your example project files directory. Navigate to your example project files directory using command line. Create your conda environment by using this file in the command. conda env create --file environment.yml Activate your conda environment using this command. conda activate reproducible-python Now start up JupyterLab again using this command: jupyter lab Follow these instructions to add the environment.yml file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. 6.4.1 More resources on how to use conda Install Jupyter using your own environment (Mac specific) Definitive guide to using conda R version of the exercise First install the renv package Go to RStudio and the Console pane: Install renv using (you should only need to do this once per your computer or RStudio environment). install.packages("renv") Now set up renv to use in your project Change to your current directory for your project using setwd() in your console window (don’t put this in a script or notebook). Use this command in your project: renv::init() This will start up renv in your particular project *What’s :: about? – in brief it allows you to use a function from a package without loading the entire thing with library(). Now you can develop your project as you normally would; installing and removing packages in R as you see fit. For the purposes of this exercise, let’s install the styler package using the following command. (The styler package will come in handy for styling our code in the next chapter). install.packages("styler") Now that we have installed styler we will want to add it to our renv snapshot. To add any packages we’ve installed to our renv snapshot we will use this command: renv::snapshot() This will save whatever packages we are currently using to our environment snapshot file called renv.lock. This renv.lock file is what we can share with our collaborators so they can replicate our computing environment. If your package installation attempts are unsuccessful and you’d like to revert to the previous state of your environment, you can run renv::restore(). This will restore your renv.lock file to what it was before you attempted to install styler or whatever packages you tried to install. You should see an renv.lock file is now created or updated! You will want to always include this file with your project files. This means we will want to add it to our GitHub! Follow these instructions to add your renv.lock file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. After you’ve added your computing environment files to your GitHub, you’re ready to continue using them with your IDE to actually work on the code in your notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["writing-durable-code.html", "Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.3 More reading on best coding practices 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) 7.5 Exercise 1: Make code more durable! 7.6 Exercise 2: Style code automatically!", " Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.2.0.1 Work on your code iteratively Getting your code to work the first time is the first step, but don’t stop there! Just like in writing a manuscript you wouldn’t consider your first draft a final draft, your polishing code works best in an iterative manner. Although you may need to set it aside for the day to give your brain a rest, return to your code later with fresh eyes and try to look for ways to improve upon it! 7.2.0.2 Prioritize readability over cleverness Some cleverness in code can be helpful, too much can make it difficult for others (including your future self!) to understand. If cleverness comprises the readability of your code, it probably is not worth it. Clever but unreadable code won’t be re-used or trusted by others (AGAIN, including your future self!). What does readable code look like? Orosz (2019) has some thoughts on writing readable code: Readable code starts with code that you find easy to read. When you finish coding, take a break to clear your mind. Then try to re-read the code, putting yourself in the mindset that you know nothing about the changes and why you made them. Can you follow along with your code? Do the variables and method names help understand what they do? Are there comments at places where just the code is not enough? Is the style of the code consistent across the changes? Think about how you could make the code more readable. Perhaps you see some functions that do too many things and are too long. Perhaps you find that renaming a variable would make its purpose clearer. Make changes until you feel like the code is as expressive, concise, and pretty as it can be. The real test of readable code is others reading it. So get feedback from others, via code reviews. Ask people to share feedback on how clear the code is. Encourage people to ask questions if something does not make sense. Code reviews - especially thorough code reviews - are the best way to get feedback on how good and readable your code is. Readable code will attract little to no clarifying questions, and reviewers won’t misunderstand it. So pay careful attention to the cases when you realize someone misunderstood the intent of what you wrote or asked a clarifying question. Every question or misunderstanding hints to opportunities to make the code more readable. A good way to get more feedback on the clarity of your code is to ask for feedback from someone who is not an expert on the codebase you are working on. Ask specifically for feedback on how easy to read your code is. Because this developer is not an expert on the codebase, they’ll focus on how much they can follow your code. Most of the comments they make will be about your code’s readability. We’ll talk a bit more about code review in an upcoming chapter! More reading: Readable Code by Orosz (2019). Write clean R code by Dubel (2021). Python Clean Code: 6 Best Practices to Make Your Python Functions More Readable by Tran (2021). 7.2.0.3 DRY up your code DRY is an acronym: “Don’t repeat yourself” (Smith 2013). “I hate code, and I want as little of it as possible in our product.” – Diederich (2012) If you find yourself writing something more than once, you might want to write a function, or store something as a variable. The added benefit of writing a function is you might be able to borrow it in another project. DRY code is easier to fix and maintain because if it breaks, its easier to fix something in one place, than in 10 places. DRY code is easier on the reviewer because they don’t have to review the same thing twice, but also because they don’t have to review the same thing twice. ;) DRYing code is something that takes some iterative passes and edits through your code, but in the end DRY code saves you and your collaborators time and can be something you reuse again in a future project! Here’s an slightly modified example from Bernardo (2021) for what DRY vs non-DRY code might look like: paste('Hello','John', 'welcome to this course') paste('Hello','Susan', 'welcome to this course') paste('Hello','Matt', 'welcome to this course') paste('Hello','Anne', 'welcome to this course') paste('Hello','Joe', 'welcome to this course') paste('Hello','Tyson', 'welcome to this course') paste('Hello','Julia', 'welcome to this course') paste('Hello','Cathy', 'welcome to this course') Could be functional-ized and rewritten as: GreetStudent <- function(name) { greeting <- paste('Hello', name, 'welcome to this course') return(greeting) } class_names <- c('John', 'Susan', 'Matt' ,'Anne', 'Joe', 'Tyson', 'Julia', 'Cathy') lapply(class_names, GreetStudent) Now, if you wanted to edit the greeting, you’d only need to edit it in the function, instead of in each instance. More reading about this idea: DRY Programming Practices by Klinefelter (2016). Keeping R Code DRY with functions by Riffomonas Project (2021). Write efficient R code for science by Max Joseph (2017). Write efficient Python code by Leah Wasser (2019). Don’t repeat yourself: Python functions by Héroux (2018). 7.2.0.4 Don’t be afraid to delete and refresh a lot Don’t be afraid to delete it all and re-run (multiple times). This includes refreshing your kernel/session in your IDE. In essence, this is the data science version of “Have you tried turning it off and then on again?” Some bugs in your code exist or are not realized because old objects and libraries have overstayed their welcome in your environment. Why do you need to refresh your kernel/session? As a quick example of why refreshing your kernel/session, let’s suppose you are troubleshooting something that centers around an object named some_obj but then you rename this object to iris_df. When you rename this object you may need to update this other places in the code. If you don’t refresh your environment while working on your code, some_obj will still be in your environment. This will make it more difficult for you to find where else the code needs to be updated. Refreshing your kernel/session goes beyond objects defined in your environment, and also can affect packages and dependencies loaded or all kinds of other things attached to your kernel/session. As a quick experiment, try this in your Python or R environment: The dir() and ls() functions list your defined variables in your Python and R environments respectively. In Python: some_obj=[] dir() Now refresh your Kernel and re-run dir() dir() You should see you no longer have some_obj listed as being defined in your environment. In R some_obj <- c() ls() Now refresh your session and re-run ls() ls() You should see you no longer have some_obj listed as being defined in your environment. Keeping around old code and objects is generally more of a hindrance than a time saver. Sometimes it can be easy to get very attached to a chunk of code that took you a long time to troubleshoot but there are three reasons you don’t need to stress about deleting it: You might write better code on the second try (or third or n’th). Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need. With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!) This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it). Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself! 7.2.0.5 Use code comments effectively Good code comments are a part of writing good, readable code! Your code is more likely to stand the test of time for longer if others, including yourself in the future, can see what’s happening enough to trust it themselves. This will encourage others to use your code and help you maintain it! ‘Current You’ who is writing your code may know what is happening but ‘Future You’ will have no idea what ‘Current You’ was thinking (Spielman, n.d.): ‘Future You’ comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. Help out ‘Future You’ by adding lots of comments! ‘Future You’ next week thinks Today You is an idiot, and the only way you can convince ‘Future You’ that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad. Your code and your understanding of it will fade soon after you write it, leaving your hard work to deprecate. Code that works is a start, but readable AND working code is best! Comments can help clarify at points where your code might need further explanation. The best code comments explain the why of what you are doing. The act of writing them can also help you think out your thought process and perhaps identify a better solution to the odd parts of your code. (From Savonen (2021a)) More reading: Creating clarifying code comments Best Practices for Writing Code Comments by Spertus (2021). What Makes a Good Code Comment? by Cronin (2019). The Value of Code Documentation by Meza (2018). Some internet wisdom on R documentation by Frazee (2014). How to Comment Your Code Like a Pro: Best Practices and Good Habits by Keeton (2019). 7.2.0.6 Use informative variable names Try to avoid using variable names that have no meaning like tmp or x, or i. Meaningful variable names make your code more readable! Additionally, variable names that are longer than one letter are much easier to search and replace if needed. One letter variables are hard to replace and hard to read. Don’t be afraid of long variable names, they are very unlikely to be confused! 1 Write intention-revealing names. 2 Use consistent notation for naming convention. 3 Use standard terms. 4 Do not number a variable name. 5 When you find another way to name variable, refactor as fast as possible. (Hobert 2018) More reading: R for Epidemiology - Coding best Practices by Cannell (2021). Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them by Koehrsen (2019). Writing Variable — Informative, Descriptive & Elegant by Hobert (2018). 7.2.0.7 Follow a code style Just like when writing doesN”t FoLLOW conv3nTi0Ns OR_sPAcinng 0r sp3llinG, it can be distracting, the same goes for code. Your code may even work all the same, just like you understood what I wrote in that last sentence, but a lack of consistent style can make require more brain power from your readers for them to understand. For reproducibility purposes, readability is important! The easier you can make it on your readers, the more likely they will be able to understand and reproduce the results. There are different style guides out there that people adhere to. It doesn’t matter so much which one you choose, so much that you pick one and stick to it for a particular project. Python style guides: PEP8 style guide “PEP 8 – Style Guide for Python Code” (2021). Google Python style guide “Styleguide” (2021). R style guides: Hadley Wickham’s Style guide Wickham (2019). Google R style guide “Google’s R Style Guide” (2021). Although writing code following a style as you are writing is a good practice, we’re all human and that can be tricky to do, so we recommend using an automatic styler on your code to fix up your code for you. For Python code, you can use python black and for R, styler. 7.2.0.8 Organize the structure of your code Readable code should follow an organized structure. Just like how outlines help the structure of manuscript writing, outlines can also help the organization of code writing. A tentative outline for a notebook might look like this: A description of the purpose of the code (in Markdown). Import the libraries you will need (including sourcing any custom functions). List any hard-coded variables. Import data. Do any data cleaning needed. The main thing you need to do. Print out session info. Note that if your notebook gets too long, you may want to separate out things in their own scripts. Additionally, it’s good practice to keep custom functions in their own file and import them. This allows you to use them elsewhere and also keeps the main part of the analysis cleaner. 7.2.0.9 Set the seed if your analysis has randomness involved If any randomness is involved in your analysis, you will want to set the seed in order for your results to be reproducible. In brief, computers don’t actually create numbers randomly they create numbers pseudorandomly. But if you want your results to be reproducible, you should give your computer a seed by which to create random numbers. This will allow anyone who re-runs your analysis to have a positive control and eliminate randomness as a reason the results were not reproducible. For more on how setting the seed works – a quick experiment To illustrate how seeds work, run we’ll run a quick experiment with setting the seed here: First let’s set a seed (it doesn’t matter what number we use, just that we pick a number), so let’s use 1234 and then create a “random” number. # Set the seed: set.seed(1234) # Now create a random number again runif(1) ## [1] 0.1137034 Now if we try a different seed, we will get a different “random” number. # Set a different seed: set.seed(4321) # Now create a random number again runif(1) ## [1] 0.334778 But, if we return to the original seed we used, 1234, we will get the original “random” number we got. # Set this back to the original seed set.seed(1234) # Now we'll get the same "random" number we got when we set the seed to 1234 previously runif(1) ## [1] 0.1137034 More reading: Set seed by Soage (2020). Generating random numbers by Chang (2021). 7.2.0.10 To review general principles: 7.3 More reading on best coding practices There’s so many opinions and strategies on best practices for code. And although a lot of these principles are generally applicable, not all of it is one size fits all. Some code practices are context-specific so sometimes you may need to pick and choose what works for you, your team, and your particular project. 7.3.0.1 Python specific: Reproducible Programming for Biologists Who Code Part 2: Should Dos by Heil (2020). 15 common coding mistakes data scientist make in Python (and how to fix them) by Csendes (2020). Data Science in Production — Advanced Python Best Practices by Kostyuk (2020). 6 Mistakes Every Python Beginner Should Avoid While Coding by Saxena (2021). 7.3.0.2 R specific: Data Carpentry’s: Best Practices for Writing R Code by “Best Practices for Writing R Code – Programming with R” (2021). R Programming for Research: Reproducible Research by Good (2021). R for Epidemiology: Coding best practices by Cannell (2021). Best practices for R Programming by Bernardo (2021). 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 7.5 Exercise 1: Make code more durable! 7.5.1 Organize the big picture of the code Before diving in line-by-line it can be helpful to make a code-outline of sorts. What are the main steps you need to accomplish in this notebook? What are the starting and ending points for this particular notebook? For example, for this make-heatmap notebook we want to: Set up analysis folders and declare file names. Install the libraries we need. Import the gene expression data and metadata. Filter down the gene expression data to genes of interest – in this instance the most variant ones. Clean the metadata. Create an annotated heatmap. Save the heatmap to a PNG. Print out the session info! Python version of the exercise The exercise: Polishing code Start up JupyterLab with running jupyter lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you made in the previous chapter make-heatmap.ipynb Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In Jupyter, you refresh your environment by using the refresh icon in the toolbar or by going to Restart Kernel. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. random.seed(1234) Use a relative file path Rationale: Absolute file paths only work for the original writer of the code and no one else. But if we make the file path relative to the project set up, then it will work for whomever has the project repository (Mustafeez 2021). Additionally, we can set up our file path names using f-Strings so that we only need to change the project ID and the rest will be ready for a new dataset (Python 2021)! Although this requires more lines of code, this set up is much more flexible and ready for others to use. Before: df1=pd.read_csv('~/a/file/path/only/I/have/SRP070849.tsv', sep='\\t') mdf=pd.read_csv('~/a/file/path/only/I/have/SRP070849_metadata.tsv', sep='\\t') After: # Declare project ID id = "SRP070849" # Define the file path to the data directory data_dir = Path(f"data/{id}") # Declare the file path to the gene expression matrix file data_file = data_dir.joinpath(f"{id}.tsv") # Declare the file path to the metadata file # inside the directory saved as `data_dir` metadata_file = data_dir.joinpath(f"metadata_{id}.tsv") # Read in metadata TSV file metadata = pd.read_csv(metadata_file, sep="\\t") # Read in data TSV file expression_df = pd.read_csv(data_file, sep="\\t") Related readings: f-strings in Python by Geeks (2018). f-Strings: A New and Improved Way to Format Strings in Python by Python (2021). Relative vs absolute file paths by Mustafeez (2021). About join path by “Python Examples of Pathlib.Path.joinpath” (2021). Avoid using mystery numbers Rationale: Avoid using numbers that don’t have context around them in the code. Include the calculations for the number, or if it needs to be hard-coded, explain the rationale for that number in the comments. Additionally, using variable and column names that tell you what is happening, helps clarify what the number represents. Before: df1['calc'] =df1.var(axis = 1, skipna = True) df2=df1[df1.calc >float(10)] After: # Calculate the variance for each gene expression_df["variance"] = expression_df.var(axis=1, skipna=True) # Find the upper quartile for these data upper_quartile = expression_df["variance"].quantile([0.90]).values # Filter the data choosing only genes whose variances are in the upper quartile df_by_var = expression_df[expression_df.variance > float(upper_quartile)] Related readings: - Stop Using Magic Numbers and Variables in Your Code by Aaberge (2021). Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing, we didn’t check for this before. After: print(metadata["refinebio_accession_code"].tolist() == expression_df.columns.tolist()) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example Python repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R version of the exercise About the tidyverse:. Before we dive into the exercise, a word about the tidyverse. The tidyverse is a highly useful set of packages for creating readable and reproducible data science workflows in R. In general, we will opt for tidyverse approaches in this course, and strongly encourage you to familiarize yourself with the tidyverse if you have not. We will point out some instances where tidyverse functions can help you DRY up your code as well as make it more readable! More reading on the tidyverse: Tidyverse Skills for Data Science by Carrie Wright (n.d.). A Beginner’s Guide to Tidyverse by A. V. Team (2019). Introduction to tidyverse by Shapiro et al. (2021). The exercise: Polishing code Open up RStudio. Open up the notebook you created in the previous chapter. Now we’ll work on applying the principles from this chapter to the code. We’ll cover some of the points here, but then we encourage you to dig into the fully transformed notebook we will link at the end of this section. Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In RStudio, you refresh your environment by going to the Run menu and using Restart R and refresh clear output. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. set.seed(1234) Get rid of setwd Rationale: setwd() almost never work for anyone besides the one person who wrote it. And in a few days/weeks it may not work for them either. Before: setwd("Super specific/filepath/that/noone/else/has/") After: Now that we are working from a notebook, we know that the default current directory is wherever the notebook is placed (Xie, Dervieux, and Riederer 2020). Related readings: Jenny Bryan will light your computer on fire if you use setwd() in a script (Bryan 2017). Give the variables more informative names Rationale: xx doesn’t tell us what is in the data here. Also by using the readr::read_tsv() from tidyverse we’ll get a cleaner, faster read and won’t have to specify sep argument. Note we are also fixing some spacing and using <- so that we can stick to readability conventions. You’ll notice later Before: xx=read.csv("metadata_SRP070849.tsv", sep = "\\t") After: metadata <- readr::read_tsv("metadata_SRP070849.tsv") Related readings: readr::read_tsv() documentation by “Read a Delimited File (Including CSV and TSV) into a Tibble — Read_delim” (n.d.). DRYing up data frame manipulations Rationale: This chunk of code can be very tricky to understand what it is doing. What is happening with df1 and df2? What’s being filtered out? etc. Code comments would certainly help understanding, but even better, we can DRY this code up and make the code clearer on its own. Before: It may be difficult to tell from looking at the before code because there are no comments and it’s a bit tricky to read, but the goal of this is to: Calculate variances for each row (each row is a gene). Filter the original gene expression matrix to only genes have a bigger variance (here we use arbitrarily 10 as a filter cutoff). df=read.csv("SRP070849.tsv", sep="\\t") sums=matrix(nrow = nrow(df), ncol = ncol(df) - 1) for(i in 1:nrow(sums)) { sums[i, ] <- sum(df[i, -1]) } df2=df[which(df[, -1] >= 10), ] variances=matrix(nrow = nrow(dds), ncol = ncol(dds) - 1) for(i in 1:nrow(dds)) { variances[i, ] <- var(dds[i, -1]) } After: Let’s see how we can do this in a DRY’er and clearer way. We can: 1) Add comments to describe our goals. 2) Use variable names that are more informative. 3) Use the apply functions to do the loop for us – this will eliminate the need for unclear variable i as well. 4) Use the tidyverse to do the filtering for us so we don’t have to rename data frames or store extra versions of df. Here’s what the above might look like after some refactoring. Hopefully you find this is easier to follow and total there’s less lines of code (but also has comments too!). # Read in data TSV file expression_df <- readr::read_tsv(data_file) %>% # Here we are going to store the gene IDs as row names so that # we can have only numeric values to perform calculations on later tibble::column_to_rownames("Gene") # Calculate the variance for each gene variances <- apply(expression_df, 1, var) # Determine the upper quartile variance cutoff value upper_var <- quantile(variances, 0.75) # Filter the data choosing only genes whose variances are in the upper quartile df_by_var <- data.frame(expression_df) %>% dplyr::filter(variances > upper_var) Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing... we didn't check for this :( After: # Make the data in the order of the metadata expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order all.equal(colnames(expression_df), metadata$refinebio_accession_code) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example R repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) Now that we’ve made some nice updates to the code, we are ready to do a bit more polishing by adding more documentation! But before we head to the next chapter, we can style the code we wrote automatically by using automatic code stylers! 7.6 Exercise 2: Style code automatically! Styling Python code automatically Run your notebook through black. First you’ll need to install it by running this command in a Terminal window in your JupyterLab. Make sure you are running this within your conda environment. conda activate reproducible-python Now install this python black. pip install black[jupyter] To record your conda environment run this command. conda env export > environment-record.yml Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: python -m black make-heatmap.ipynb You should get a message that your notebook was styled! Styling R code automatically Let’s run your notebook through styler. First you’ll need to install it and add it to your renv. install.packages("styler") Then add it to your renv by running: renv::snapshot() Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: styler::style_file("make-heatmap.Rmd") You should get a message that your notebook was styled! Before you are done with this exercise, there’s one more thing we need to do: upload the latest version to GitHub. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["documenting-analyses.html", "Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) 8.4 Exercise 1: Practice beefing up your notebook descriptions 8.5 Exercise 2: Write a README for your project!", " Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? Documentation is an important but sometimes overlooked part of creating a reproducible analysis! There are two parts of documentation we will discuss here: 1) In notebook descriptions and 2) READMEs Both these notebook descriptions and READMEs are written in markdown – a shorthand for html (the same as the documentation parts of your code). If you aren’t familiar, markdown is such a handy tool and we encourage you to learn it (it doesn’t take too long), here’s a quick guide to get you started. 8.2.1 Notebook descriptions As we discussed in chapter 5, data analyses can lead one on a winding trail of decisions, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! Your scientific notebook should include descriptions that describe: 8.2.1.1 The purposes of the notebook What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? 8.2.1.2 The rationales behind your decisions Describe why a particular code chunk is doing a particular thing – the more odd the code looks, the greater need for you to describe why you are doing it. Describe any particular filters or cutoffs you are using and how did you decide on those? For data wrangling steps, why are you wrangling the data in such a way – is this because a certain package you are using requires it? 8.2.1.3 Your observations of the results What do you think about the results? The plots and tables you show in the notebook – how do they inform your original questions? 8.2.2 READMEs! READMEs are also a great way to help your collaborators get quickly acquainted with the project. READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called “README.md” when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. Information that should be included in a README: General purpose of the project Instructions on how to re-run the project Lists of any software required by the project Input and output file descriptions. Descriptions of any additional tools included in the project? You can take a look at this template README to get your started. 8.2.2.1 More about writing READMEs: How to write a good README file A Beginners Guide to writing a Kicka** README How to write an awesome README 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 8.4 Exercise 1: Practice beefing up your notebook descriptions Python project exercise Start up JuptyerLab with running juptyer lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.ipynb Create a new chunk in your notebook and choose the “Markdown” option in the dropdown menu. ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gfaa026a583_0_30") 5. Continue to add more descriptions where you feel is necessary, You can reference the descriptions we have in the “final” version looks like in the example Python repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R project exercise Open up RStudio. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.Rmd In between code chunks, add more descriptions using Markdown language. You can test how this renders by saving your .Rmd and then opening up the resulting nb.html file and choosing View in Browser. Continue to add more descriptions where you feel is necessary. You can reference the descriptions we have in the “final” version looks like in the example R repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) 8.5 Exercise 2: Write a README for your project! Download this template README. Fill in the questions inside the { } to create a README for this project. You can reference the “final” versions of the README, but keep in mind it will reference items that we will discuss in the “advanced” portion of this course. See the R README here and the Python README here. Add your README and updated notebook to your GitHub repository. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["code-review.html", "Chapter 9 Code review 9.1 Learning Objectives 9.2 Exercise: Set up your code review request!", " Chapter 9 Code review 9.1 Learning Objectives We’ve previously discussed that the only way to know if your analysis is truly reproducible is to send it to someone else to reproduce! That sentiment is at the heart of code review. Although most wouldn’t dare send out a manuscript for publishing without having our collaborators giving it a line-by-line review, people don’t always feel the same way about code. Parker (2017) describes code review: Code review will not guarantee an accurate analysis, but it’s one of the most reliable ways of establishing one that is more accurate than before. Not only does code review help boost the accuracy and reproducibility of the analysis, it also helps everyone involved in the process learn something new! 9.1.0.1 Recommended reading about code review Code Review Guidelines for Humans by Hauer (2018). Your Code Sucks! – Code Review Best Practices by Hildebr (2020). Best practices for Code Review by S. Team (2021). Why code reviews matter (and actually save time!) by Radigan (2021). 9.2 Exercise: Set up your code review request! Since reproducibility is all about someone else being able to run your code and obtain your results, the exercise in this course involves preparing to do just that! The goal: In the second part of this reproducibility course we will discuss how to conduct formal line-by-line code reviews, but for now, we will discuss how to prep your analysis for someone else look at your code and attempt to run it. At this point, you should have a GitHub repository that contains the following: A make_heatmap notebook A README A data folder containing the metadata and gene expression matrix files in a folder named SRP070849: SRP070849/metadata_SRP070849.tsv SRP070849/SRP070849.tsv 1) Refresh and delete output Before you send off your code to someone else, delete your output (the results and plots folders) and attempt to re-run it yourself. This also involves restarting your R session/Python kernel and running all the chunks again. 2) Re-run the whole analysis 3) Interrogate and troubleshoot If your code has any issues running, try your best to troubleshoot the problems. Read this handy guide for tips on troubleshooting R. 4) Rinse and repeat Repeat this as many times as needed until you reliably are able to re-run this code and get the same results without any code smells popping up. Dig into bad code smells or bad results smells wherever you sense them. If you aren’t sure why you feel this way about your code or results, hold on to this and it may be something your collaborator will be able to see something you don’t. 5) Let it simmer Leave your analysis for a bit. Do you think it’s perfect? Are you at your wits end with it? No matter how you feel about it, let it sit for a half a day or so then return to it with fresh eyes (Savonen 2021b). 5) Re-review your documentation and code with fresh eyes Now with fresh eyes and doing your best to imagine you don’t have the knowledge you have – does your analysis and results make sense? 6) Are you sure it’s ready? Ask yourself if you’ve polished this code and documentation as far as you can reasonably take it? Realizing that determining what qualifies as far as you can reasonably take it is also a skill you will build with time. Code review is the most efficient use of everyone’s time when your code and documentation have reached this point. 8) Draft your request Now you are ready to send this code to your collaborator, but first try to send them a specific set of instructions and questions about what you would like them to review, in your message to them include this information (You may want to draft this out in a scratch file): Code review requests should include: A link to your repository that has your README to get them quickly oriented to the project. A request for what kind of feedback you are looking for. Big picture? Technical? Method selection? Are there specific areas of the code you are having trouble with or are unsure about? Send a link to the specific lines in GitHub you are asking about. Are there results that are surprising, confusing, or smell wrong? Be sure to detail what you have dug into and tried at this point for any problematic points. Explicitly ask them what commands or tests you’d like them to test run. Lastly, thank them for helping review your code! 9) Ready for review Now you are ready to send your crafted message to your collaborator for review. But, for the purposes of this exercise, you may not want to ask your collaborator to spend their time carefully review this practice repository, but now that you understand and have done the steps involved you are prepared to do this for your own analyses! TL;DR for asking for a code review: Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! In the second part of this course, we will discuss how to conduct code review through GitHub, further utilize version control, and more! References "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines. Credits Names Pedagogy Lead Content Instructor(s) Candace Savonen Lecturer(s) Candace Savonen Content Directors Jeff Leek, Sarah Wheelan Content Consultants [David Swiderski] Acknowledgments [Patrick O’Connell] Production Content Publisher Ira Gooding Content Publishing Reviewers Ira Gooding Technical Course Publishing Engineer Candace Savonen Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal)John Muschelli, Candace Savonen, Carrie Wright Art and Design Illustrator Candace Savonen Figure Artist Candace Savonen Videographer Candace Savonen Videography Editor Candace Savonen Funding Funder National Cancer Institute (NCI) UE5 CA254170 Funding Staff Emily Voeglein, Fallon Bachman ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.3 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2022-10-13 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3) ## bookdown 0.24 2022-02-15 [1] Github (rstudio/bookdown@88bc4ea) ## callr 3.4.4 2020-09-07 [1] RSPM (R 4.0.2) ## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.6.1 2022-01-22 [1] CRAN (R 4.0.2) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## knitr 1.33 2022-02-15 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) ## magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## ottrpal 0.1.2 2022-02-15 [1] Github (jhudsl/ottrpal@1018848) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.3.4 2020-08-11 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 0.4.10 2022-02-15 [1] Github (r-lib/rlang@f0c9be5) ## rmarkdown 2.10 2022-02-15 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2022-02-15 [1] Github (R-lib/testthat@e99155a) ## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) ## usethis 2.1.5.9000 2022-02-15 [1] Github (r-lib/usethis@57b109a) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2022-02-15 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library David Swiderski Patrick O’Connell "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
+[["index.html", "Intro to Reproducibility in Cancer Informatics About this Course 0.1 Available course formats", " Intro to Reproducibility in Cancer Informatics March, 2024 About this Course This course is part of a series of courses for the Informatics Technology for Cancer Research (ITCR) called the Informatics Technology for Cancer Research Education Resource. This material was created by the ITCR Training Network (ITN) which is a collaborative effort of researchers around the United States to support cancer informatics and data science training through resources, technology, and events. This initiative is funded by the following grant: National Cancer Institute (NCI) UE5 CA254170. Our courses feature tools developed by ITCR Investigators and make it easier for principal investigators, scientists, and analysts to integrate cancer informatics into their workflows. Please see our website at www.itcrtraining.org for more information. 0.1 Available course formats This course is available in multiple formats which allows you to take it in the way that best suites your needs. You can take it for certificate which can be for free or fee. The material for this course can be viewed without login requirement on this Bookdown website. This format might be most appropriate for you if you rely on screen-reader technology. This course can be taken for free certification through Leanpub. This course can be taken on Coursera for certification here (but it is not available for free on Coursera). Our courses are open source, you can find the source material for this course on GitHub. "],["introduction.html", "Chapter 1 Introduction 1.1 Target Audience 1.2 Topics covered: 1.3 Motivation 1.4 Curriculum 1.5 How to use the course", " Chapter 1 Introduction 1.1 Target Audience The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods. This course is written for individuals who: Have some familiarity with R or Python - have written some scripts. Have not had formal training in computational methods. Have limited or no familiar with GitHub, Docker, or package management tools. 1.2 Topics covered: This is a two part series: 1.3 Motivation Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods. Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (Beaulieu-Jones and Greene 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers’ time so they don’t have to reinvent the proverbial wheel for methods that everyone in the field is already performing. 1.4 Curriculum This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub. The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. Goal of this course: Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an “ease into it” approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses. What is not the goal This course is meant to introduce learners to the reproducibility tools, but it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally “better practices”. 1.5 How to use the course This course is designed with busy professional learners in mind – who may have to pick up and put down the course when their schedule allows. Each exercise has the option for you to continue along with the example files as you’ve been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you’ve been working on no longer make sense, you have a fresh starting point at each exercise. References "],["defining-reproducibility.html", "Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility 2.3 Reproducibility in daily life 2.4 Reproducibility is worth the effort! 2.5 Reproducibility exists on a continuum!", " Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility There’s been a lot of discussion about what is included in the term reproducibility and there is some discrepancy between fields. For the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found. Reproducibility is related to repeatability and replicability but it is worth taking time to differentiate these terms Perhaps you are like Ruby and have just found an interesting pattern through your data analysis! This has probably been the result of many months or years on your project and it’s worth celebrating! But before she considers these results a done deal, Ruby should test whether she is able to re-run her own analysis and get the same results again. This is known as repeatability. Given that Ruby’s analysis is repeatable; she may feel confident now to share her preliminary results with her colleague, Avi the Associate. Whether or not someone else will be able to take Ruby’s code and data, re-run the analysis and obtain the same results is known as reproducibility. If Ruby’s results are able to be reproduced by Avi, now Avi may collect new data and use Ruby’s same analysis methods to analyze his data. Whether or not Avi’s new data and results concur with Ruby’s study’s original inferences is known as replicability. You may realize that these levels of research build on each other (like science is supposed to do). In this way, we can think of these in a hierarchy. Skipping any of these levels of research applicability can lead to unreliable results and conclusions. Science progresses when data and hypotheses are put through these levels thoroughly and sequentially. If results are not repeatable, they won’t be reproducible or replicable. Ideally all analyses and results would be reproducible without too much time and effort spent; this would aid in the efficiency of research getting to the next stages and questions. But unfortunately, in practice, reproducibility is not as commonplace as we would hope. Institutions and reward systems generally do not prioritize or even measure reproducibility standards in research and training opportunities for reproducible techniques can be scarce. Reproducible research can often feel like an uphill battle that is made steeper by lack of training opportunities. In this course, we hope to equip your research with the tools you need to enhance the reproducibility of your analyses so this uphill battle is less steep. 2.3 Reproducibility in daily life What does reproducibility mean in the daily life of a researcher? Let’s say Ruby’s results are repeatable in her own hands and she excitedly tells her associate, Avi, about her preliminary findings. Avi is very excited about these results as well as Ruby’s methods! Avi is also interested in Ruby’s analysis methods and results. So Ruby sends Avi the code and data she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby’s analysis is reproducible. Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi’s computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author (Beaulieu-Jones and Greene 2017). Avi is encountering errors because Ruby’s code was written with Ruby’s computer and local setup in mind and she didn’t know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby’s same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors). Avi is still struggling to work with Ruby’s code and is confused about the goals and approaches the code is taking. After struggling with Avi’s code for an untold amount of time, Avi may decide it’s time to email Ruby to get some clarity. Now both Avi and Ruby are confused about why this analysis isn’t nicely re-running for Avi. Their attempts to communicate about the code through email haven’t helped them clarify anything. Multiple versions of the code may have been sent back and forth between them and now things are taking a lot more time than either of them expected. Perhaps at some point Avi is able to successfully run Ruby’s code on Ruby’s same data. Just because Avi didn’t get any errors doesn’t mean that the code ran exactly the same as it did for Ruby. Lack of errors also doesn’t mean that either Ruby or Avi’s runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments. 2.4 Reproducibility is worth the effort! Perhaps you’ve found yourself in a situation like Ruby and Avi; struggling to re-run code that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects’ reproducibility. As you apply these reproducible techniques to your own projects, you may feel like it is taking more time to reach endpoints, but keep in mind that reproducible analyses and projects have higher upfront costs but these will absolutely pay off in the long term. Reproducibility in your analyses is not only a time saver for yourself, but also your colleagues, your field, and your future self! You might not change a single character in your code but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code stands the test of time longer, making ‘future you’ glad you spent the time to work on it. It’s said that your closest collaborator is you from 6 months ago but you don’t reply to email (Broman 2016). Many a data scientist has referred to their frustration with their past selves: Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley — Hadley Wickham (@hadleywickham) April 7, 2016 The more you comment your code, and make it clear and readable, your future self will thank you. Reproducible code also saves your colleagues time! The more reproducible your code is, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your code and need to try to fix it, the more time is wasted. This can add up to a lot of wasted researcher time and effort. But, reproducible code saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your code and analyses in the future! 2.5 Reproducibility exists on a continuum! Incremental work on your analyses is good! You do not need to make your analyses perfect on the first try or even within a particular time frame. The first step in creating an analysis is to get it to work once! But the work does not end there. Furthermore, no analysis is or will ever be perfect in that it will not be reproducible in every single context throughout time. incrementally pushing our analyses toward the right of this continuum is the goal. References "],["organizing-your-project.html", "Chapter 3 Organizing your project 3.1 Learning Objectives 3.2 Organizational strategies 3.3 Readings about organizational strategies for data science projects: 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) 3.5 Exercise: Organize your project!", " Chapter 3 Organizing your project 3.1 Learning Objectives Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped. Tayo (2019) discusses four particular reasons why it is important to organize your project: Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. Organization is yet another aspect of reproducibility that saves you and your colleagues time! 3.2 Organizational strategies There’s a lot of ways to keep your files organized, and there’s not a “one size fits all” organizational solution (Shapiro et al. 2021). In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. The most important aspects of your project organization scheme is that it: Is project-oriented (Bryan 2017). Follows consistent patterns (Shapiro et al. 2021). Is easy for you and others to find the files you need quickly (Shapiro et al. 2021). Minimizes the likelihood for errors (like writing over files accidentally) (Shapiro et al. 2021). Is something maintainable (Shapiro et al. 2021)! 3.2.1 Tips for organizing your project: Getting more specific, here’s some ideas of how to organize your project: Make file names informative to those who don’t have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders – these only serve to make reading in files a nightmare in some programs. Number scripts in the order that they are run. Keep like-files together in their own directory: results tables with other results tables, etc. Including most importantly keeping raw data separate from processed data or other results! Put source scripts and functions in their own directory. Things that should never need to be called directly by yourself or anyone else. Put output in its own directories like results and plots. Have a central document (like a README) that describes the basic information about the analysis and how to re-run it. Make it easy on yourself, dates aren’t necessary. The computer keeps track of those. Make a central script that re-runs everything – including the creation of the folders! (more on this in a later chapter) Let’s see what these principles might look like put into practice. 3.2.1.1 Example organizational scheme Here’s an example of what this might look like: project-name/ ├── run_analysis.sh ├── 00-download-data.sh ├── 01-make-heatmap.Rmd ├── README.md ├── plots/ │ └── project-name-heatmap.png ├── results/ │ └── top_gene_results.tsv ├── raw-data/ │ ├── project-name-raw.tsv │ └── project-name-metadata.tsv ├── processed-data/ │ ├── project-name-quantile-normalized.tsv └── util/ ├── plotting-functions.R └── data-wrangling-functions.R What these hypothetical files and folders contain: run_analysis.sh - A central script that runs everything again 00-download-data.sh - The script that needs to be run first and is called by run_analysis.sh 01-make-heatmap.Rmd - The script that needs to be run second and is also called by run_analysis.sh README.md - The document that has the information that will orient someone to this project, we’ll discuss more about how to create a helpful README in an upcoming chapter. plots - A folder of plots and resulting images results - A folder results raw-data - Data files as they first arrive and nothing has been done to them yet. processed-data - Data that has been modified from the raw in some way. util - A folder of utilities that never needs to be called or touched directly unless troubleshooting something 3.3 Readings about organizational strategies for data science projects: But you don’t have to take my organizational strategy, there are lots of ideas out there. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: Jenny Bryan’s organizational strategies (Bryan and Hester 2021). Danielle Navarro’s organizational strategies Navarro (2021) Jenny Bryan on Project-oriented workflows(Bryan 2017). Data Carpentry mini-course about organizing projects (“Project Organization and Management for Genomics” 2021). Andrew Severin’s strategy for organization (Severin 2021). A BioStars thread where many individuals share their own organizational strategies (“How Do You Manage Your Files & Directories for Your Projects?” 2010). Data Carpentry course chapter about getting organized (“Introduction to the Command Line for Genomics” 2019). 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 3.5 Exercise: Organize your project! Using your computer’s GUI (drag, drop, and clicking), organize the files that are part of this project. Organized these files using an organizational scheme similar to what is described above. Create folders like plots, results, and data folder. Note that aggregated_metadata.json and LICENSE.TXT also belong in the data folder. You will want to delete any files that say “OLD”. Keeping multiple versions of your scripts around is a recipe for mistakes and confusion. In the advanced course we will discuss how to use version control to help you track this more elegantly. After your files are organized, you are ready to move on to the next chapter and create a notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["making-your-project-open-source-with-github.html", "Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) 4.3 Exercise: Set up a project on GitHub", " Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git. All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy. There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files. 4.1.1 GitHub and git allow you to… 4.1.1.1 Maintain transparent analyses Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time! 4.1.1.2 Have backups of your code and analyses at every point Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses. 4.1.1.3 Keep a documented history of your project Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered. 4.1.1.4 Collaborate with others Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way. 4.1.1.5 Experiment with your analysis Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture. 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 4.3 Exercise: Set up a project on GitHub Go here for the video tutorial version of this exercise. Now that we understand how useful GitHub is for creating reproducible analyses, it’s time to set ourselves up on GitHub. Git and GitHub have a whole rich world of tools and terms that can get complex quickly, but for this exercise, we will not worry about those terms and functionalities just yet, but focus on getting code up on GitHub so we are ready to collaborate and conduct open analyses! Go to Github’s main page and click Sign Up if you don’t have an account. Follow these instructions to create a repository. As a general, but not absolute rule, you will want to keep one GitHub repository for one analysis project. Name the repository something that reminds you what its related to. Choose Public. Check the box that says Add a README. Follow these instructions to add the example files you downloaded to your new repository. Congrats! You’ve started your very own project on GitHub! We encourage you to do the same with your own code and other projects! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["using-notebooks.html", "Chapter 5 Using Notebooks 5.1 Learning Objectives 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) 5.3 Exercise: Convert code into a notebook!", " Chapter 5 Using Notebooks 5.1 Learning Objectives Notebooks are a handy way to have the code, output, and scientist’s thought process all documented in one place that is easy for others to read and follow. The notebook environment is incredibly useful for reproducible data science for a variety of reasons: 5.1.0.1 Reason 1: Notebooks allow for tracking data exploration and encourage the scientist to narrate their thought process: Each executed code cell is an attempt by the researcher to achieve something and to tease out some insight from the data set. The result is displayed immediately below the code commands, and the researcher can pause and think about the outcome. As code cells can be executed in any order, modified and re-executed as desired, deleted and copied, the notebook is a convenient environment to iteratively explore a complex problem. (Fangohr 2021) 5.1.0.2 Reason 2: Notebooks allow for easy sharing of results: Notebooks can be converted to html and pdf, and then shared as static read-only documents. This is useful to communicate and share a study with colleagues or managers. By adding sufficient explanation, the main story can be understood by the reader, even if they wouldn’t be able to write the code that is embedded in the document. (Fangohr 2021) 5.1.0.3 Reason 3: Notebooks can be re-ran as a script or developed interactively: A common pattern in science is that a computational recipe is iteratively developed in a notebook. Once this has been found and should be applied to further data sets (or other points in some parameter space), the notebook can be executed like a script, for example by submitting these scripts as batch jobs. (Fangohr 2021) This can also be handy especially if you use automation to enhance the reproducibility of your analyses (something we will talk about in the advanced part of this course). Because of all of these reasons, we encourage the use of computational notebooks as a means of enhancing reproducibility. (This course itself is also written with the use of notebooks!) 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 5.3 Exercise: Convert code into a notebook! 5.3.1 Set up your IDE For this chapter, we will create notebooks from our example files code. Notebooks work best with the integrated development environment (IDE) they were created to work with. IDE’s are sets of tools that help you develop your code. They are part “point and click” and part command line and include lots of visuals that will help guide you. Set up a Python IDE Install JupyterLab We advise using the conda method to install JupyterLab, because we will return to talk more about conda later on, so if you don’t have conda, you will need to install that first. We advise going with Anaconda instead of miniconda. To install Anaconda you can download from here. Download the installer, and follow the installation prompts. Start up Anaconda navigator. On the home page choose JupyterLab and click Install. This may take a few minutes. Now you should be able to click Launch underneath JupyterLab. This will open up a page in your Browser with JupyterLab. Getting familiar with JupyterLab’s interface The JupyterLab interface consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and a menu bar. The left sidebar contains a file browser, the list of running kernels and terminals, the command palette, the notebook cell tools inspector, and the tabs list. The menu bar at the top of JupyterLab has top-level menus that expose actions available in JupyterLab with their keyboard shortcuts. The default menus are: File: actions related to files and directories Edit: actions related to editing documents and other activities View: actions that alter the appearance of JupyterLab Run: actions for running code in different activities such as notebooks and code consoles Kernel: actions for managing kernels, which are separate processes for running code Tabs: a list of the open documents and activities in the dock panel Settings: common settings and an advanced settings editor Help: a list of JupyterLab and kernel help links Set up an R IDE Install RStudio Install RStudio (and install R first if you have not already). After you’ve downloaded the RStudio installation file, double click on it and follow along with the installation prompts. Open up the RStudio application by double clicking on it. Getting familiar with RStudio’s interface The RStudio environment has four main panes, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout). The Editor pane is where you can write R scripts and other documents. Each tab here is its own document. This is your text editor, which will allow you to save your R code for future use. Note that change code here will not run automatically until you run it. The Console pane is where you can interactively run R code. There is also a Terminal tab here which can be used for running programs outside R on your computer The Environment pane primarily displays the variables, sometimes known as objects that are defined during a given R session, and what data or values they might hold. The Help viewer pane has several tabs all of which are pretty important: The Files tab shows the structure and contents of files and folders (also known as directories) on your computer. The Plots tab will reveal plots when you make them The Packages tab shows which installed packages have been loaded into your R session The Help tab will show the help page when you look up a function The Viewer pane will reveal compiled R Markdown documents From Shapiro et al. (2021) More reading about RStudio’s interface: RStudio IDE Cheatsheet (pdf). Navigating the RStudio Interface - R for Epidemiology 5.3.2 Create a notebook! Now, in your respective IDE, we’ll turn our unreproducible scripts into notebooks. In the next chapter we will begin to dive into the code itself, but for now, we’ll get the notebook ready to go. Set up a Python notebook Start a new notebook by going to New > Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file. Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file.” style=“display: block; margin: auto;” /> Create a new code chunk in your notebook. Now copy and paste all of the code from make-heatmap.py into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. Save your Untitled.ipynb file as something that tells us what it will end up doing like make-heatmap.ipynb. For more about using Jupyter notebooks see this by Mike (2021). Set up an R notebook Start a new notebook by going to File > New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file. New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file.” style=“display: block; margin: auto;” /> Practice creating a new chunk in your R notebook by clicking the Code > Insert Chunk button on the toolbar or by pressing Cmd+Option+I (in Mac) or Ctrl + Alt + I (in Windows). (You can also manually type out the back ticks and {}) Delete all the default text in this notebook but keep the header which is surrounded by --- and looks like: title: "R Notebook" output: html_notebook You can feel free to change the title from R Notebook to something that better suits the contents of this notebook. 5. Now copy and paste all of the code from make_heatmap.R into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. 6. Save your untitled.Rmd into something that tells us what it will end up doing like make-heatmap.Rmd. 7. Notice that upon saving your .Rmd file, a new file .nb.html file of the same name is created. Open that file and choose view in Browser. If RStudio asks you to choose a browser, then choose a default browser. 8. This shows the nicely rendered version of your analysis and snapshots whatever output existed when the .Rmd file was saved. For more about using R notebooks see this by Xie, Allaire, and Grolemund (2018). Now that you’ve created your notebook, you are ready to start polishing that code! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["managing-package-versions.html", "Chapter 6 Managing package versions 6.1 Learning Objectives 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) 6.3 Exercise 1: Print out session info 6.4 Exercise 2: Package management", " Chapter 6 Managing package versions 6.1 Learning Objectives As we discussed previously, sometimes two different researchers can run the same code and same data and get different results! What Ruby and Avi may not realize is that although they may have used the same code and data, the software packages that they have on each of their computers might be very different. Even if they have the same software packages, they likely don’t have the same versions and versions can influence results! Different computing environments are not only a headache to detangle, they also can influence the reproducibility of your results (Beaulieu-Jones and Greene 2017). There are multiple ways to deal with variations in computing environments so that your analyses will be reproducible and we will discuss a few different strategies for tackling this problem in this course and its follow up course. But for now, we will start with the least intensive to implement: session info. There are two strategies for dealing with software versions that we will discuss in this chapter. Either of these strategies can be used alone or you can use both. They address different aspects of the computing environment discrepancy problem. 6.1.1 Strategy 1: Session Info - record a list of your packages One strategy to combat different software versions is to list the session info. This is the easiest (though not most comprehensive) method for handling differences in software versions is to have your code list details about your computing environment. Session info can lead to clues as to why results weren’t reproducible. For example, if both Avi and Ruby ran notebooks and included a session info print out it may look like this: Session info shows us that they have different R versions and different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions. If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues for any discrepancies. This can give them items to look into for determining why the results didn’t reproduce as expected. 6.1.2 Strategy 2: Package managers - share a useable snapshot of your environment Package managers can help handle your computing environment for you in a way that you can share them with others. In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. For R and Python versions of the exercises, we will be using different managers, but the foundational strategy will be the same: include a file that someone else could replicate your package set up from. For both exercises, we will download an environment ‘snapshot’ file we’ve set up for you, then we will practice adding a new package to the environments we’ve provided, and add them to your new repository along with the rest of your example project files. For Python, we’ll use conda for package management and store this information in a environment.yml file. For R, we’ll use renv for package management and store this information in a renv.lock file. 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 6.3 Exercise 1: Print out session info Python version of the exercise In your scientific notebook, you’ll need to add two items. 1. Add the import session_info to a code chunk at the beginning of your notebook. 2. Add session_info.show() to a new code chunk at the very end of your notebook. 2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. R version of the exercise In your Rmd file, add a chunk in the very end that looks like this: ```r sessionInfo() ``` ``` ## R version 4.0.2 (2020-06-22) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.3 LTS ## ## Matrix products: default ## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] knitr_1.33 magrittr_2.0.2 hms_0.5.3 R6_2.4.1 ## [5] rlang_0.4.10 highr_0.8 stringr_1.4.0 httr_1.4.2 ## [9] tools_4.0.2 xfun_0.26 jquerylib_0.1.4 htmltools_0.5.0 ## [13] ellipsis_0.3.1 ottrpal_0.1.2 yaml_2.2.1 digest_0.6.25 ## [17] tibble_3.0.3 lifecycle_1.0.0 crayon_1.3.4 bookdown_0.24 ## [21] readr_1.4.0 vctrs_0.3.4 fs_1.5.0 curl_4.3 ## [25] evaluate_0.14 rmarkdown_2.10 stringi_1.5.3 compiler_4.0.2 ## [29] pillar_1.4.6 pkgconfig_2.0.3 ``` Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. 6.4 Exercise 2: Package management Python version of the exercise Download this starter conda environment.yml file by clicking on the link and place it with your example project files directory. Navigate to your example project files directory using command line. Create your conda environment by using this file in the command. conda env create --file environment.yml Activate your conda environment using this command. conda activate reproducible-python Now start up JupyterLab again using this command: jupyter lab Follow these instructions to add the environment.yml file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. 6.4.1 More resources on how to use conda Install Jupyter using your own environment (Mac specific) Definitive guide to using conda R version of the exercise First install the renv package Go to RStudio and the Console pane: Install renv using (you should only need to do this once per your computer or RStudio environment). install.packages("renv") Now set up renv to use in your project Change to your current directory for your project using setwd() in your console window (don’t put this in a script or notebook). Use this command in your project: renv::init() This will start up renv in your particular project *What’s :: about? – in brief it allows you to use a function from a package without loading the entire thing with library(). Now you can develop your project as you normally would; installing and removing packages in R as you see fit. For the purposes of this exercise, let’s install the styler package using the following command. (The styler package will come in handy for styling our code in the next chapter). install.packages("styler") Now that we have installed styler we will want to add it to our renv snapshot. To add any packages we’ve installed to our renv snapshot we will use this command: renv::snapshot() This will save whatever packages we are currently using to our environment snapshot file called renv.lock. This renv.lock file is what we can share with our collaborators so they can replicate our computing environment. If your package installation attempts are unsuccessful and you’d like to revert to the previous state of your environment, you can run renv::restore(). This will restore your renv.lock file to what it was before you attempted to install styler or whatever packages you tried to install. You should see an renv.lock file is now created or updated! You will want to always include this file with your project files. This means we will want to add it to our GitHub! Follow these instructions to add your renv.lock file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. After you’ve added your computing environment files to your GitHub, you’re ready to continue using them with your IDE to actually work on the code in your notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["writing-durable-code.html", "Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.3 More reading on best coding practices 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) 7.5 Exercise 1: Make code more durable! 7.6 Exercise 2: Style code automatically!", " Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.2.0.1 Work on your code iteratively Getting your code to work the first time is the first step, but don’t stop there! Just like in writing a manuscript you wouldn’t consider your first draft a final draft, your polishing code works best in an iterative manner. Although you may need to set it aside for the day to give your brain a rest, return to your code later with fresh eyes and try to look for ways to improve upon it! 7.2.0.2 Prioritize readability over cleverness Some cleverness in code can be helpful, too much can make it difficult for others (including your future self!) to understand. If cleverness comprises the readability of your code, it probably is not worth it. Clever but unreadable code won’t be re-used or trusted by others (AGAIN, including your future self!). What does readable code look like? Orosz (2019) has some thoughts on writing readable code: Readable code starts with code that you find easy to read. When you finish coding, take a break to clear your mind. Then try to re-read the code, putting yourself in the mindset that you know nothing about the changes and why you made them. Can you follow along with your code? Do the variables and method names help understand what they do? Are there comments at places where just the code is not enough? Is the style of the code consistent across the changes? Think about how you could make the code more readable. Perhaps you see some functions that do too many things and are too long. Perhaps you find that renaming a variable would make its purpose clearer. Make changes until you feel like the code is as expressive, concise, and pretty as it can be. The real test of readable code is others reading it. So get feedback from others, via code reviews. Ask people to share feedback on how clear the code is. Encourage people to ask questions if something does not make sense. Code reviews - especially thorough code reviews - are the best way to get feedback on how good and readable your code is. Readable code will attract little to no clarifying questions, and reviewers won’t misunderstand it. So pay careful attention to the cases when you realize someone misunderstood the intent of what you wrote or asked a clarifying question. Every question or misunderstanding hints to opportunities to make the code more readable. A good way to get more feedback on the clarity of your code is to ask for feedback from someone who is not an expert on the codebase you are working on. Ask specifically for feedback on how easy to read your code is. Because this developer is not an expert on the codebase, they’ll focus on how much they can follow your code. Most of the comments they make will be about your code’s readability. We’ll talk a bit more about code review in an upcoming chapter! More reading: Readable Code by Orosz (2019). Write clean R code by Dubel (2021). Python Clean Code: 6 Best Practices to Make Your Python Functions More Readable by Tran (2021). 7.2.0.3 DRY up your code DRY is an acronym: “Don’t repeat yourself” (Smith 2013). “I hate code, and I want as little of it as possible in our product.” – Diederich (2012) If you find yourself writing something more than once, you might want to write a function, or store something as a variable. The added benefit of writing a function is you might be able to borrow it in another project. DRY code is easier to fix and maintain because if it breaks, its easier to fix something in one place, than in 10 places. DRY code is easier on the reviewer because they don’t have to review the same thing twice, but also because they don’t have to review the same thing twice. ;) DRYing code is something that takes some iterative passes and edits through your code, but in the end DRY code saves you and your collaborators time and can be something you reuse again in a future project! Here’s an slightly modified example from Bernardo (2021) for what DRY vs non-DRY code might look like: paste('Hello','John', 'welcome to this course') paste('Hello','Susan', 'welcome to this course') paste('Hello','Matt', 'welcome to this course') paste('Hello','Anne', 'welcome to this course') paste('Hello','Joe', 'welcome to this course') paste('Hello','Tyson', 'welcome to this course') paste('Hello','Julia', 'welcome to this course') paste('Hello','Cathy', 'welcome to this course') Could be functional-ized and rewritten as: GreetStudent <- function(name) { greeting <- paste('Hello', name, 'welcome to this course') return(greeting) } class_names <- c('John', 'Susan', 'Matt' ,'Anne', 'Joe', 'Tyson', 'Julia', 'Cathy') lapply(class_names, GreetStudent) Now, if you wanted to edit the greeting, you’d only need to edit it in the function, instead of in each instance. More reading about this idea: DRY Programming Practices by Klinefelter (2016). Keeping R Code DRY with functions by Riffomonas Project (2021). Write efficient R code for science by Max Joseph (2017). Write efficient Python code by Leah Wasser (2019). Don’t repeat yourself: Python functions by Héroux (2018). 7.2.0.4 Don’t be afraid to delete and refresh a lot Don’t be afraid to delete it all and re-run (multiple times). This includes refreshing your kernel/session in your IDE. In essence, this is the data science version of “Have you tried turning it off and then on again?” Some bugs in your code exist or are not realized because old objects and libraries have overstayed their welcome in your environment. Why do you need to refresh your kernel/session? As a quick example of why refreshing your kernel/session, let’s suppose you are troubleshooting something that centers around an object named some_obj but then you rename this object to iris_df. When you rename this object you may need to update this other places in the code. If you don’t refresh your environment while working on your code, some_obj will still be in your environment. This will make it more difficult for you to find where else the code needs to be updated. Refreshing your kernel/session goes beyond objects defined in your environment, and also can affect packages and dependencies loaded or all kinds of other things attached to your kernel/session. As a quick experiment, try this in your Python or R environment: The dir() and ls() functions list your defined variables in your Python and R environments respectively. In Python: some_obj=[] dir() Now refresh your Kernel and re-run dir() dir() You should see you no longer have some_obj listed as being defined in your environment. In R some_obj <- c() ls() Now refresh your session and re-run ls() ls() You should see you no longer have some_obj listed as being defined in your environment. Keeping around old code and objects is generally more of a hindrance than a time saver. Sometimes it can be easy to get very attached to a chunk of code that took you a long time to troubleshoot but there are three reasons you don’t need to stress about deleting it: You might write better code on the second try (or third or n’th). Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need. With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!) This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it). Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself! 7.2.0.5 Use code comments effectively Good code comments are a part of writing good, readable code! Your code is more likely to stand the test of time for longer if others, including yourself in the future, can see what’s happening enough to trust it themselves. This will encourage others to use your code and help you maintain it! ‘Current You’ who is writing your code may know what is happening but ‘Future You’ will have no idea what ‘Current You’ was thinking (Spielman, n.d.): ‘Future You’ comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. Help out ‘Future You’ by adding lots of comments! ‘Future You’ next week thinks Today You is an idiot, and the only way you can convince ‘Future You’ that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad. Your code and your understanding of it will fade soon after you write it, leaving your hard work to deprecate. Code that works is a start, but readable AND working code is best! Comments can help clarify at points where your code might need further explanation. The best code comments explain the why of what you are doing. The act of writing them can also help you think out your thought process and perhaps identify a better solution to the odd parts of your code. (From Savonen (2021a)) More reading: Creating clarifying code comments Best Practices for Writing Code Comments by Spertus (2021). What Makes a Good Code Comment? by Cronin (2019). The Value of Code Documentation by Meza (2018). Some internet wisdom on R documentation by Frazee (2014). How to Comment Your Code Like a Pro: Best Practices and Good Habits by Keeton (2019). 7.2.0.6 Use informative variable names Try to avoid using variable names that have no meaning like tmp or x, or i. Meaningful variable names make your code more readable! Additionally, variable names that are longer than one letter are much easier to search and replace if needed. One letter variables are hard to replace and hard to read. Don’t be afraid of long variable names, they are very unlikely to be confused! 1 Write intention-revealing names. 2 Use consistent notation for naming convention. 3 Use standard terms. 4 Do not number a variable name. 5 When you find another way to name variable, refactor as fast as possible. (Hobert 2018) More reading: R for Epidemiology - Coding best Practices by Cannell (2021). Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them by Koehrsen (2019). Writing Variable — Informative, Descriptive & Elegant by Hobert (2018). 7.2.0.7 Follow a code style Just like when writing doesN”t FoLLOW conv3nTi0Ns OR_sPAcinng 0r sp3llinG, it can be distracting, the same goes for code. Your code may even work all the same, just like you understood what I wrote in that last sentence, but a lack of consistent style can make require more brain power from your readers for them to understand. For reproducibility purposes, readability is important! The easier you can make it on your readers, the more likely they will be able to understand and reproduce the results. There are different style guides out there that people adhere to. It doesn’t matter so much which one you choose, so much that you pick one and stick to it for a particular project. Python style guides: PEP8 style guide “PEP 8 – Style Guide for Python Code” (2021). Google Python style guide “Styleguide” (2021). R style guides: Hadley Wickham’s Style guide Wickham (2019). Google R style guide “Google’s R Style Guide” (2021). Although writing code following a style as you are writing is a good practice, we’re all human and that can be tricky to do, so we recommend using an automatic styler on your code to fix up your code for you. For Python code, you can use python black and for R, styler. 7.2.0.8 Organize the structure of your code Readable code should follow an organized structure. Just like how outlines help the structure of manuscript writing, outlines can also help the organization of code writing. A tentative outline for a notebook might look like this: A description of the purpose of the code (in Markdown). Import the libraries you will need (including sourcing any custom functions). List any hard-coded variables. Import data. Do any data cleaning needed. The main thing you need to do. Print out session info. Note that if your notebook gets too long, you may want to separate out things in their own scripts. Additionally, it’s good practice to keep custom functions in their own file and import them. This allows you to use them elsewhere and also keeps the main part of the analysis cleaner. 7.2.0.9 Set the seed if your analysis has randomness involved If any randomness is involved in your analysis, you will want to set the seed in order for your results to be reproducible. In brief, computers don’t actually create numbers randomly they create numbers pseudorandomly. But if you want your results to be reproducible, you should give your computer a seed by which to create random numbers. This will allow anyone who re-runs your analysis to have a positive control and eliminate randomness as a reason the results were not reproducible. For more on how setting the seed works – a quick experiment To illustrate how seeds work, run we’ll run a quick experiment with setting the seed here: First let’s set a seed (it doesn’t matter what number we use, just that we pick a number), so let’s use 1234 and then create a “random” number. # Set the seed: set.seed(1234) # Now create a random number again runif(1) ## [1] 0.1137034 Now if we try a different seed, we will get a different “random” number. # Set a different seed: set.seed(4321) # Now create a random number again runif(1) ## [1] 0.334778 But, if we return to the original seed we used, 1234, we will get the original “random” number we got. # Set this back to the original seed set.seed(1234) # Now we'll get the same "random" number we got when we set the seed to 1234 previously runif(1) ## [1] 0.1137034 More reading: Set seed by Soage (2020). Generating random numbers by Chang (2021). 7.2.0.10 To review general principles: 7.3 More reading on best coding practices There’s so many opinions and strategies on best practices for code. And although a lot of these principles are generally applicable, not all of it is one size fits all. Some code practices are context-specific so sometimes you may need to pick and choose what works for you, your team, and your particular project. 7.3.0.1 Python specific: Reproducible Programming for Biologists Who Code Part 2: Should Dos by Heil (2020). 15 common coding mistakes data scientist make in Python (and how to fix them) by Csendes (2020). Data Science in Production — Advanced Python Best Practices by Kostyuk (2020). 6 Mistakes Every Python Beginner Should Avoid While Coding by Saxena (2021). 7.3.0.2 R specific: Data Carpentry’s: Best Practices for Writing R Code by “Best Practices for Writing R Code – Programming with R” (2021). R Programming for Research: Reproducible Research by Good (2021). R for Epidemiology: Coding best practices by Cannell (2021). Best practices for R Programming by Bernardo (2021). 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 7.5 Exercise 1: Make code more durable! 7.5.1 Organize the big picture of the code Before diving in line-by-line it can be helpful to make a code-outline of sorts. What are the main steps you need to accomplish in this notebook? What are the starting and ending points for this particular notebook? For example, for this make-heatmap notebook we want to: Set up analysis folders and declare file names. Install the libraries we need. Import the gene expression data and metadata. Filter down the gene expression data to genes of interest – in this instance the most variant ones. Clean the metadata. Create an annotated heatmap. Save the heatmap to a PNG. Print out the session info! Python version of the exercise The exercise: Polishing code Start up JupyterLab with running jupyter lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you made in the previous chapter make-heatmap.ipynb Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In Jupyter, you refresh your environment by using the refresh icon in the toolbar or by going to Restart Kernel. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. random.seed(1234) Use a relative file path Rationale: Absolute file paths only work for the original writer of the code and no one else. But if we make the file path relative to the project set up, then it will work for whomever has the project repository (Mustafeez 2021). Additionally, we can set up our file path names using f-Strings so that we only need to change the project ID and the rest will be ready for a new dataset (Python 2021)! Although this requires more lines of code, this set up is much more flexible and ready for others to use. Before: df1=pd.read_csv('~/a/file/path/only/I/have/SRP070849.tsv', sep='\\t') mdf=pd.read_csv('~/a/file/path/only/I/have/SRP070849_metadata.tsv', sep='\\t') After: # Declare project ID id = "SRP070849" # Define the file path to the data directory data_dir = Path(f"data/{id}") # Declare the file path to the gene expression matrix file data_file = data_dir.joinpath(f"{id}.tsv") # Declare the file path to the metadata file # inside the directory saved as `data_dir` metadata_file = data_dir.joinpath(f"metadata_{id}.tsv") # Read in metadata TSV file metadata = pd.read_csv(metadata_file, sep="\\t") # Read in data TSV file expression_df = pd.read_csv(data_file, sep="\\t") Related readings: f-strings in Python by Geeks (2018). f-Strings: A New and Improved Way to Format Strings in Python by Python (2021). Relative vs absolute file paths by Mustafeez (2021). About join path by “Python Examples of Pathlib.Path.joinpath” (2021). Avoid using mystery numbers Rationale: Avoid using numbers that don’t have context around them in the code. Include the calculations for the number, or if it needs to be hard-coded, explain the rationale for that number in the comments. Additionally, using variable and column names that tell you what is happening, helps clarify what the number represents. Before: df1['calc'] =df1.var(axis = 1, skipna = True) df2=df1[df1.calc >float(10)] After: # Calculate the variance for each gene expression_df["variance"] = expression_df.var(axis=1, skipna=True) # Find the upper quartile for these data upper_quartile = expression_df["variance"].quantile([0.90]).values # Filter the data choosing only genes whose variances are in the upper quartile df_by_var = expression_df[expression_df.variance > float(upper_quartile)] Related readings: - Stop Using Magic Numbers and Variables in Your Code by Aaberge (2021). Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing, we didn’t check for this before. After: print(metadata["refinebio_accession_code"].tolist() == expression_df.columns.tolist()) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example Python repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R version of the exercise About the tidyverse:. Before we dive into the exercise, a word about the tidyverse. The tidyverse is a highly useful set of packages for creating readable and reproducible data science workflows in R. In general, we will opt for tidyverse approaches in this course, and strongly encourage you to familiarize yourself with the tidyverse if you have not. We will point out some instances where tidyverse functions can help you DRY up your code as well as make it more readable! More reading on the tidyverse: Tidyverse Skills for Data Science by Carrie Wright (n.d.). A Beginner’s Guide to Tidyverse by A. V. Team (2019). Introduction to tidyverse by Shapiro et al. (2021). The exercise: Polishing code Open up RStudio. Open up the notebook you created in the previous chapter. Now we’ll work on applying the principles from this chapter to the code. We’ll cover some of the points here, but then we encourage you to dig into the fully transformed notebook we will link at the end of this section. Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In RStudio, you refresh your environment by going to the Run menu and using Restart R and refresh clear output. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. set.seed(1234) Get rid of setwd Rationale: setwd() almost never work for anyone besides the one person who wrote it. And in a few days/weeks it may not work for them either. Before: setwd("Super specific/filepath/that/noone/else/has/") After: Now that we are working from a notebook, we know that the default current directory is wherever the notebook is placed (Xie, Dervieux, and Riederer 2020). Related readings: Jenny Bryan will light your computer on fire if you use setwd() in a script (Bryan 2017). Give the variables more informative names Rationale: xx doesn’t tell us what is in the data here. Also by using the readr::read_tsv() from tidyverse we’ll get a cleaner, faster read and won’t have to specify sep argument. Note we are also fixing some spacing and using <- so that we can stick to readability conventions. You’ll notice later Before: xx=read.csv("metadata_SRP070849.tsv", sep = "\\t") After: metadata <- readr::read_tsv("metadata_SRP070849.tsv") Related readings: readr::read_tsv() documentation by “Read a Delimited File (Including CSV and TSV) into a Tibble — Read_delim” (n.d.). DRYing up data frame manipulations Rationale: This chunk of code can be very tricky to understand what it is doing. What is happening with df1 and df2? What’s being filtered out? etc. Code comments would certainly help understanding, but even better, we can DRY this code up and make the code clearer on its own. Before: It may be difficult to tell from looking at the before code because there are no comments and it’s a bit tricky to read, but the goal of this is to: Calculate variances for each row (each row is a gene). Filter the original gene expression matrix to only genes have a bigger variance (here we use arbitrarily 10 as a filter cutoff). df=read.csv("SRP070849.tsv", sep="\\t") sums=matrix(nrow = nrow(df), ncol = ncol(df) - 1) for(i in 1:nrow(sums)) { sums[i, ] <- sum(df[i, -1]) } df2=df[which(df[, -1] >= 10), ] variances=matrix(nrow = nrow(dds), ncol = ncol(dds) - 1) for(i in 1:nrow(dds)) { variances[i, ] <- var(dds[i, -1]) } After: Let’s see how we can do this in a DRY’er and clearer way. We can: 1) Add comments to describe our goals. 2) Use variable names that are more informative. 3) Use the apply functions to do the loop for us – this will eliminate the need for unclear variable i as well. 4) Use the tidyverse to do the filtering for us so we don’t have to rename data frames or store extra versions of df. Here’s what the above might look like after some refactoring. Hopefully you find this is easier to follow and total there’s less lines of code (but also has comments too!). # Read in data TSV file expression_df <- readr::read_tsv(data_file) %>% # Here we are going to store the gene IDs as row names so that # we can have only numeric values to perform calculations on later tibble::column_to_rownames("Gene") # Calculate the variance for each gene variances <- apply(expression_df, 1, var) # Determine the upper quartile variance cutoff value upper_var <- quantile(variances, 0.75) # Filter the data choosing only genes whose variances are in the upper quartile df_by_var <- data.frame(expression_df) %>% dplyr::filter(variances > upper_var) Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing... we didn't check for this :( After: # Make the data in the order of the metadata expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order all.equal(colnames(expression_df), metadata$refinebio_accession_code) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example R repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) Now that we’ve made some nice updates to the code, we are ready to do a bit more polishing by adding more documentation! But before we head to the next chapter, we can style the code we wrote automatically by using automatic code stylers! 7.6 Exercise 2: Style code automatically! Styling Python code automatically Run your notebook through black. First you’ll need to install it by running this command in a Terminal window in your JupyterLab. Make sure you are running this within your conda environment. conda activate reproducible-python Now install this python black. pip install black[jupyter] To record your conda environment run this command. conda env export > environment-record.yml Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: python -m black make-heatmap.ipynb You should get a message that your notebook was styled! Styling R code automatically Let’s run your notebook through styler. First you’ll need to install it and add it to your renv. install.packages("styler") Then add it to your renv by running: renv::snapshot() Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: styler::style_file("make-heatmap.Rmd") You should get a message that your notebook was styled! Before you are done with this exercise, there’s one more thing we need to do: upload the latest version to GitHub. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["documenting-analyses.html", "Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) 8.4 Exercise 1: Practice beefing up your notebook descriptions 8.5 Exercise 2: Write a README for your project!", " Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? Documentation is an important but sometimes overlooked part of creating a reproducible analysis! There are two parts of documentation we will discuss here: 1) In notebook descriptions and 2) READMEs Both these notebook descriptions and READMEs are written in markdown – a shorthand for html (the same as the documentation parts of your code). If you aren’t familiar, markdown is such a handy tool and we encourage you to learn it (it doesn’t take too long), here’s a quick guide to get you started. 8.2.1 Notebook descriptions As we discussed in chapter 5, data analyses can lead one on a winding trail of decisions, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! Your scientific notebook should include descriptions that describe: 8.2.1.1 The purposes of the notebook What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? 8.2.1.2 The rationales behind your decisions Describe why a particular code chunk is doing a particular thing – the more odd the code looks, the greater need for you to describe why you are doing it. Describe any particular filters or cutoffs you are using and how did you decide on those? For data wrangling steps, why are you wrangling the data in such a way – is this because a certain package you are using requires it? 8.2.1.3 Your observations of the results What do you think about the results? The plots and tables you show in the notebook – how do they inform your original questions? 8.2.2 READMEs! READMEs are also a great way to help your collaborators get quickly acquainted with the project. READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called “README.md” when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. Information that should be included in a README: General purpose of the project Instructions on how to re-run the project Lists of any software required by the project Input and output file descriptions. Descriptions of any additional tools included in the project? You can take a look at this template README to get your started. 8.2.2.1 More about writing READMEs: How to write a good README file A Beginners Guide to writing a Kicka** README How to write an awesome README 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 8.4 Exercise 1: Practice beefing up your notebook descriptions Python project exercise Start up JuptyerLab with running juptyer lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.ipynb Create a new chunk in your notebook and choose the “Markdown” option in the dropdown menu. ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gfaa026a583_0_30") 5. Continue to add more descriptions where you feel is necessary, You can reference the descriptions we have in the “final” version looks like in the example Python repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R project exercise Open up RStudio. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.Rmd In between code chunks, add more descriptions using Markdown language. You can test how this renders by saving your .Rmd and then opening up the resulting nb.html file and choosing View in Browser. Continue to add more descriptions where you feel is necessary. You can reference the descriptions we have in the “final” version looks like in the example R repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) 8.5 Exercise 2: Write a README for your project! Download this template README. Fill in the questions inside the { } to create a README for this project. You can reference the “final” versions of the README, but keep in mind it will reference items that we will discuss in the “advanced” portion of this course. See the R README here and the Python README here. Add your README and updated notebook to your GitHub repository. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["code-review.html", "Chapter 9 Code review 9.1 Learning Objectives 9.2 Exercise: Set up your code review request!", " Chapter 9 Code review 9.1 Learning Objectives We’ve previously discussed that the only way to know if your analysis is truly reproducible is to send it to someone else to reproduce! That sentiment is at the heart of code review. Although most wouldn’t dare send out a manuscript for publishing without having our collaborators giving it a line-by-line review, people don’t always feel the same way about code. Parker (2017) describes code review: Code review will not guarantee an accurate analysis, but it’s one of the most reliable ways of establishing one that is more accurate than before. Not only does code review help boost the accuracy and reproducibility of the analysis, it also helps everyone involved in the process learn something new! 9.1.0.1 Recommended reading about code review Code Review Guidelines for Humans by Hauer (2018). Your Code Sucks! – Code Review Best Practices by Hildebr (2020). Best practices for Code Review by S. Team (2021). Why code reviews matter (and actually save time!) by Radigan (2021). 9.2 Exercise: Set up your code review request! Since reproducibility is all about someone else being able to run your code and obtain your results, the exercise in this course involves preparing to do just that! The goal: In the second part of this reproducibility course we will discuss how to conduct formal line-by-line code reviews, but for now, we will discuss how to prep your analysis for someone else look at your code and attempt to run it. At this point, you should have a GitHub repository that contains the following: A make_heatmap notebook A README A data folder containing the metadata and gene expression matrix files in a folder named SRP070849: SRP070849/metadata_SRP070849.tsv SRP070849/SRP070849.tsv 1) Refresh and delete output Before you send off your code to someone else, delete your output (the results and plots folders) and attempt to re-run it yourself. This also involves restarting your R session/Python kernel and running all the chunks again. 2) Re-run the whole analysis 3) Interrogate and troubleshoot If your code has any issues running, try your best to troubleshoot the problems. Read this handy guide for tips on troubleshooting R. 4) Rinse and repeat Repeat this as many times as needed until you reliably are able to re-run this code and get the same results without any code smells popping up. Dig into bad code smells or bad results smells wherever you sense them. If you aren’t sure why you feel this way about your code or results, hold on to this and it may be something your collaborator will be able to see something you don’t. 5) Let it simmer Leave your analysis for a bit. Do you think it’s perfect? Are you at your wits end with it? No matter how you feel about it, let it sit for a half a day or so then return to it with fresh eyes (Savonen 2021b). 5) Re-review your documentation and code with fresh eyes Now with fresh eyes and doing your best to imagine you don’t have the knowledge you have – does your analysis and results make sense? 6) Are you sure it’s ready? Ask yourself if you’ve polished this code and documentation as far as you can reasonably take it? Realizing that determining what qualifies as far as you can reasonably take it is also a skill you will build with time. Code review is the most efficient use of everyone’s time when your code and documentation have reached this point. 8) Draft your request Now you are ready to send this code to your collaborator, but first try to send them a specific set of instructions and questions about what you would like them to review, in your message to them include this information (You may want to draft this out in a scratch file): Code review requests should include: A link to your repository that has your README to get them quickly oriented to the project. A request for what kind of feedback you are looking for. Big picture? Technical? Method selection? Are there specific areas of the code you are having trouble with or are unsure about? Send a link to the specific lines in GitHub you are asking about. Are there results that are surprising, confusing, or smell wrong? Be sure to detail what you have dug into and tried at this point for any problematic points. Explicitly ask them what commands or tests you’d like them to test run. Lastly, thank them for helping review your code! 9) Ready for review Now you are ready to send your crafted message to your collaborator for review. But, for the purposes of this exercise, you may not want to ask your collaborator to spend their time carefully review this practice repository, but now that you understand and have done the steps involved you are prepared to do this for your own analyses! TL;DR for asking for a code review: Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! In the second part of this course, we will discuss how to conduct code review through GitHub, further utilize version control, and more! References "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines. Credits Names Pedagogy Lead Content Instructor(s) Candace Savonen Lecturer(s) Candace Savonen Content Directors Jeff Leek, Sarah Wheelan Content Consultants [David Swiderski] Acknowledgments [Patrick O’Connell] Production Content Publisher Ira Gooding Content Publishing Reviewers Ira Gooding Technical Course Publishing Engineer Candace Savonen Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal)John Muschelli, Candace Savonen, Carrie Wright Art and Design Illustrator Candace Savonen Figure Artist Candace Savonen Videographer Candace Savonen Videography Editor Candace Savonen Funding Funder National Cancer Institute (NCI) UE5 CA254170 Funding Staff Emily Voeglein, Fallon Bachman ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.3 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-03-25 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3) ## bookdown 0.24 2022-02-15 [1] Github (rstudio/bookdown@88bc4ea) ## callr 3.4.4 2020-09-07 [1] RSPM (R 4.0.2) ## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.6.1 2022-01-22 [1] CRAN (R 4.0.2) ## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## knitr 1.33 2022-02-15 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) ## magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## ottrpal 0.1.2 2022-02-15 [1] Github (jhudsl/ottrpal@1018848) ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.3.4 2020-08-11 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## readr 1.4.0 2020-10-05 [1] RSPM (R 4.0.2) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 0.4.10 2022-02-15 [1] Github (r-lib/rlang@f0c9be5) ## rmarkdown 2.10 2022-02-15 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2022-02-15 [1] Github (R-lib/testthat@e99155a) ## tibble 3.0.3 2020-07-10 [1] RSPM (R 4.0.2) ## usethis 2.1.5.9000 2022-02-15 [1] Github (r-lib/usethis@57b109a) ## vctrs 0.3.4 2020-08-29 [1] RSPM (R 4.0.2) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2022-02-15 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library David Swiderski Patrick O’Connell "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
diff --git a/docs/no_toc/using-notebooks.html b/docs/no_toc/using-notebooks.html
index 3e607888..f6c3814d 100644
--- a/docs/no_toc/using-notebooks.html
+++ b/docs/no_toc/using-notebooks.html
@@ -347,14 +347,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -512,13 +512,18 @@
References
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018.
R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC.
https://bookdown.org/yihui/rmarkdown.
+
+
+
+
+
+
+
+
diff --git a/docs/no_toc/writing-durable-code.html b/docs/no_toc/writing-durable-code.html
index dc879b54..f0dae65b 100644
--- a/docs/no_toc/writing-durable-code.html
+++ b/docs/no_toc/writing-durable-code.html
@@ -421,7 +421,7 @@ Don’t be afraid to dele
Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need.
-With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!)
+With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!)
This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it).
Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself!
@@ -568,7 +568,7 @@ Python specific:
+
+
+
+
+
+
+
+
diff --git a/docs/organizing-your-project.html b/docs/organizing-your-project.html
index ba6b188d..071a271e 100644
--- a/docs/organizing-your-project.html
+++ b/docs/organizing-your-project.html
@@ -410,14 +410,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
+
+
+
+
+
+
+
+
diff --git a/docs/references.html b/docs/references.html
index f5e92b73..b9150c58 100644
--- a/docs/references.html
+++ b/docs/references.html
@@ -498,13 +498,18 @@ References
+
+
+
+
+
+
+
+
diff --git a/docs/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png b/docs/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png
index eb5c3a98..781c1ada 100644
Binary files a/docs/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png and b/docs/resources/images/01-intro_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g106226cdd08_0_0.png differ
diff --git a/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png b/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png
index 3493e335..b96bc1fc 100644
Binary files a/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png and b/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_g102dc56db08_0_0.png differ
diff --git a/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png b/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png
index e631c7d5..8e97ff1c 100644
Binary files a/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png and b/docs/resources/images/06-package-management_files/figure-html/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE_gf62875ddf7_0_404.png differ
diff --git a/docs/search_index.json b/docs/search_index.json
index 8f9ec39b..4c5e5fbe 100644
--- a/docs/search_index.json
+++ b/docs/search_index.json
@@ -1 +1 @@
-[["index.html", "Intro to Reproducibility in Cancer Informatics About this Course 0.1 Available course formats", " Intro to Reproducibility in Cancer Informatics October, 2022 About this Course This course is part of a series of courses for the Informatics Technology for Cancer Research (ITCR) called the Informatics Technology for Cancer Research Education Resource. This material was created by the ITCR Training Network (ITN) which is a collaborative effort of researchers around the United States to support cancer informatics and data science training through resources, technology, and events. This initiative is funded by the following grant: National Cancer Institute (NCI) UE5 CA254170. Our courses feature tools developed by ITCR Investigators and make it easier for principal investigators, scientists, and analysts to integrate cancer informatics into their workflows. Please see our website at www.itcrtraining.org for more information. 0.1 Available course formats This course is available in multiple formats which allows you to take it in the way that best suites your needs. You can take it for certificate which can be for free or fee. The material for this course can be viewed without login requirement on this Bookdown website. This format might be most appropriate for you if you rely on screen-reader technology. This course can be taken for free certification through Leanpub. This course can be taken on Coursera for certification here (but it is not available for free on Coursera). Our courses are open source, you can find the source material for this course on GitHub. "],["introduction.html", "Chapter 1 Introduction 1.1 Target Audience 1.2 Topics covered: 1.3 Motivation 1.4 Curriculum 1.5 How to use the course", " Chapter 1 Introduction 1.1 Target Audience The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods. This course is written for individuals who: Have some familiarity with R or Python - have written some scripts. Have not had formal training in computational methods. Have limited or no familiar with GitHub, Docker, or package management tools. 1.2 Topics covered: This is a two part series: 1.3 Motivation Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods. Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (Beaulieu-Jones and Greene 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers’ time so they don’t have to reinvent the proverbial wheel for methods that everyone in the field is already performing. 1.4 Curriculum This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub. The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. Goal of this course: Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an “ease into it” approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses. What is not the goal This course is meant to introduce learners to the reproducibility tools, but it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally “better practices”. 1.5 How to use the course This course is designed with busy professional learners in mind – who may have to pick up and put down the course when their schedule allows. Each exercise has the option for you to continue along with the example files as you’ve been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you’ve been working on no longer make sense, you have a fresh starting point at each exercise. References "],["defining-reproducibility.html", "Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility 2.3 Reproducibility in daily life 2.4 Reproducibility is worth the effort! 2.5 Reproducibility exists on a continuum!", " Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility There’s been a lot of discussion about what is included in the term reproducibility and there is some discrepancy between fields. For the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found. Reproducibility is related to repeatability and replicability but it is worth taking time to differentiate these terms Perhaps you are like Ruby and have just found an interesting pattern through your data analysis! This has probably been the result of many months or years on your project and it’s worth celebrating! But before she considers these results a done deal, Ruby should test whether she is able to re-run her own analysis and get the same results again. This is known as repeatability. Given that Ruby’s analysis is repeatable; she may feel confident now to share her preliminary results with her colleague, Avi the Associate. Whether or not someone else will be able to take Ruby’s code and data, re-run the analysis and obtain the same results is known as reproducibility. If Ruby’s results are able to be reproduced by Avi, now Avi may collect new data and use Ruby’s same analysis methods to analyze his data. Whether or not Avi’s new data and results concur with Ruby’s study’s original inferences is known as replicability. You may realize that these levels of research build on each other (like science is supposed to do). In this way, we can think of these in a hierarchy. Skipping any of these levels of research applicability can lead to unreliable results and conclusions. Science progresses when data and hypotheses are put through these levels thoroughly and sequentially. If results are not repeatable, they won’t be reproducible or replicable. Ideally all analyses and results would be reproducible without too much time and effort spent; this would aid in the efficiency of research getting to the next stages and questions. But unfortunately, in practice, reproducibility is not as commonplace as we would hope. Institutions and reward systems generally do not prioritize or even measure reproducibility standards in research and training opportunities for reproducible techniques can be scarce. Reproducible research can often feel like an uphill battle that is made steeper by lack of training opportunities. In this course, we hope to equip your research with the tools you need to enhance the reproducibility of your analyses so this uphill battle is less steep. 2.3 Reproducibility in daily life What does reproducibility mean in the daily life of a researcher? Let’s say Ruby’s results are repeatable in her own hands and she excitedly tells her associate, Avi, about her preliminary findings. Avi is very excited about these results as well as Ruby’s methods! Avi is also interested in Ruby’s analysis methods and results. So Ruby sends Avi the code and data she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby’s analysis is reproducible. Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi’s computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author (Beaulieu-Jones and Greene 2017). Avi is encountering errors because Ruby’s code was written with Ruby’s computer and local setup in mind and she didn’t know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby’s same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors). Avi is still struggling to work with Ruby’s code and is confused about the goals and approaches the code is taking. After struggling with Avi’s code for an untold amount of time, Avi may decide it’s time to email Ruby to get some clarity. Now both Avi and Ruby are confused about why this analysis isn’t nicely re-running for Avi. Their attempts to communicate about the code through email haven’t helped them clarify anything. Multiple versions of the code may have been sent back and forth between them and now things are taking a lot more time than either of them expected. Perhaps at some point Avi is able to successfully run Ruby’s code on Ruby’s same data. Just because Avi didn’t get any errors doesn’t mean that the code ran exactly the same as it did for Ruby. Lack of errors also doesn’t mean that either Ruby or Avi’s runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments. 2.4 Reproducibility is worth the effort! Perhaps you’ve found yourself in a situation like Ruby and Avi; struggling to re-run code that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects’ reproducibility. As you apply these reproducible techniques to your own projects, you may feel like it is taking more time to reach endpoints, but keep in mind that reproducible analyses and projects have higher upfront costs but these will absolutely pay off in the long term. Reproducibility in your analyses is not only a time saver for yourself, but also your colleagues, your field, and your future self! You might not change a single character in your code but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code stands the test of time longer, making ‘future you’ glad you spent the time to work on it. It’s said that your closest collaborator is you from 6 months ago but you don’t reply to email (Broman 2016). Many a data scientist has referred to their frustration with their past selves: Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley — Hadley Wickham (@hadleywickham) April 7, 2016 The more you comment your code, and make it clear and readable, your future self will thank you. Reproducible code also saves your colleagues time! The more reproducible your code is, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your code and need to try to fix it, the more time is wasted. This can add up to a lot of wasted researcher time and effort. But, reproducible code saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your code and analyses in the future! 2.5 Reproducibility exists on a continuum! Incremental work on your analyses is good! You do not need to make your analyses perfect on the first try or even within a particular time frame. The first step in creating an analysis is to get it to work once! But the work does not end there. Furthermore, no analysis is or will ever be perfect in that it will not be reproducible in every single context throughout time. incrementally pushing our analyses toward the right of this continuum is the goal. References "],["organizing-your-project.html", "Chapter 3 Organizing your project 3.1 Learning Objectives 3.2 Organizational strategies 3.3 Readings about organizational strategies for data science projects: 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) 3.5 Exercise: Organize your project!", " Chapter 3 Organizing your project 3.1 Learning Objectives Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped. Tayo (2019) discusses four particular reasons why it is important to organize your project: Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. Organization is yet another aspect of reproducibility that saves you and your colleagues time! 3.2 Organizational strategies There’s a lot of ways to keep your files organized, and there’s not a “one size fits all” organizational solution (Shapiro et al. 2021). In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. The most important aspects of your project organization scheme is that it: Is project-oriented (Bryan 2017). Follows consistent patterns (Shapiro et al. 2021). Is easy for you and others to find the files you need quickly (Shapiro et al. 2021). Minimizes the likelihood for errors (like writing over files accidentally) (Shapiro et al. 2021). Is something maintainable (Shapiro et al. 2021)! 3.2.1 Tips for organizing your project: Getting more specific, here’s some ideas of how to organize your project: Make file names informative to those who don’t have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders – these only serve to make reading in files a nightmare in some programs. Number scripts in the order that they are run. Keep like-files together in their own directory: results tables with other results tables, etc. Including most importantly keeping raw data separate from processed data or other results! Put source scripts and functions in their own directory. Things that should never need to be called directly by yourself or anyone else. Put output in its own directories like results and plots. Have a central document (like a README) that describes the basic information about the analysis and how to re-run it. Make it easy on yourself, dates aren’t necessary. The computer keeps track of those. Make a central script that re-runs everything – including the creation of the folders! (more on this in a later chapter) Let’s see what these principles might look like put into practice. 3.2.1.1 Example organizational scheme Here’s an example of what this might look like: project-name/ ├── run_analysis.sh ├── 00-download-data.sh ├── 01-make-heatmap.Rmd ├── README.md ├── plots/ │ └── project-name-heatmap.png ├── results/ │ └── top_gene_results.tsv ├── raw-data/ │ ├── project-name-raw.tsv │ └── project-name-metadata.tsv ├── processed-data/ │ ├── project-name-quantile-normalized.tsv └── util/ ├── plotting-functions.R └── data-wrangling-functions.R What these hypothetical files and folders contain: run_analysis.sh - A central script that runs everything again 00-download-data.sh - The script that needs to be run first and is called by run_analysis.sh 01-make-heatmap.Rmd - The script that needs to be run second and is also called by run_analysis.sh README.md - The document that has the information that will orient someone to this project, we’ll discuss more about how to create a helpful README in an upcoming chapter. plots - A folder of plots and resulting images results - A folder results raw-data - Data files as they first arrive and nothing has been done to them yet. processed-data - Data that has been modified from the raw in some way. util - A folder of utilities that never needs to be called or touched directly unless troubleshooting something 3.3 Readings about organizational strategies for data science projects: But you don’t have to take my organizational strategy, there are lots of ideas out there. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: Jenny Bryan’s organizational strategies (Bryan and Hester 2021). Danielle Navarro’s organizational strategies Navarro (2021) Jenny Bryan on Project-oriented workflows(Bryan 2017). Data Carpentry mini-course about organizing projects (“Project Organization and Management for Genomics” 2021). Andrew Severin’s strategy for organization (Severin 2021). A BioStars thread where many individuals share their own organizational strategies (“How Do You Manage Your Files & Directories for Your Projects?” 2010). Data Carpentry course chapter about getting organized (“Introduction to the Command Line for Genomics” 2019). 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 3.5 Exercise: Organize your project! Using your computer’s GUI (drag, drop, and clicking), organize the files that are part of this project. Organized these files using an organizational scheme similar to what is described above. Create folders like plots, results, and data folder. Note that aggregated_metadata.json and LICENSE.TXT also belong in the data folder. You will want to delete any files that say “OLD”. Keeping multiple versions of your scripts around is a recipe for mistakes and confusion. In the advanced course we will discuss how to use version control to help you track this more elegantly. After your files are organized, you are ready to move on to the next chapter and create a notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["making-your-project-open-source-with-github.html", "Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) 4.3 Exercise: Set up a project on GitHub", " Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git. All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy. There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files. 4.1.1 GitHub and git allow you to… 4.1.1.1 Maintain transparent analyses Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time! 4.1.1.2 Have backups of your code and analyses at every point Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses. 4.1.1.3 Keep a documented history of your project Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered. 4.1.1.4 Collaborate with others Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way. 4.1.1.5 Experiment with your analysis Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture. 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 4.3 Exercise: Set up a project on GitHub Go here for the video tutorial version of this exercise. Now that we understand how useful GitHub is for creating reproducible analyses, it’s time to set ourselves up on GitHub. Git and GitHub have a whole rich world of tools and terms that can get complex quickly, but for this exercise, we will not worry about those terms and functionalities just yet, but focus on getting code up on GitHub so we are ready to collaborate and conduct open analyses! Go to Github’s main page and click Sign Up if you don’t have an account. Follow these instructions to create a repository. As a general, but not absolute rule, you will want to keep one GitHub repository for one analysis project. Name the repository something that reminds you what its related to. Choose Public. Check the box that says Add a README. Follow these instructions to add the example files you downloaded to your new repository. Congrats! You’ve started your very own project on GitHub! We encourage you to do the same with your own code and other projects! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["using-notebooks.html", "Chapter 5 Using Notebooks 5.1 Learning Objectives 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) 5.3 Exercise: Convert code into a notebook!", " Chapter 5 Using Notebooks 5.1 Learning Objectives Notebooks are a handy way to have the code, output, and scientist’s thought process all documented in one place that is easy for others to read and follow. The notebook environment is incredibly useful for reproducible data science for a variety of reasons: 5.1.0.1 Reason 1: Notebooks allow for tracking data exploration and encourage the scientist to narrate their thought process: Each executed code cell is an attempt by the researcher to achieve something and to tease out some insight from the data set. The result is displayed immediately below the code commands, and the researcher can pause and think about the outcome. As code cells can be executed in any order, modified and re-executed as desired, deleted and copied, the notebook is a convenient environment to iteratively explore a complex problem. (Fangohr 2021) 5.1.0.2 Reason 2: Notebooks allow for easy sharing of results: Notebooks can be converted to html and pdf, and then shared as static read-only documents. This is useful to communicate and share a study with colleagues or managers. By adding sufficient explanation, the main story can be understood by the reader, even if they wouldn’t be able to write the code that is embedded in the document. (Fangohr 2021) 5.1.0.3 Reason 3: Notebooks can be re-ran as a script or developed interactively: A common pattern in science is that a computational recipe is iteratively developed in a notebook. Once this has been found and should be applied to further data sets (or other points in some parameter space), the notebook can be executed like a script, for example by submitting these scripts as batch jobs. (Fangohr 2021) This can also be handy especially if you use automation to enhance the reproducibility of your analyses (something we will talk about in the advanced part of this course). Because of all of these reasons, we encourage the use of computational notebooks as a means of enhancing reproducibility. (This course itself is also written with the use of notebooks!) 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 5.3 Exercise: Convert code into a notebook! 5.3.1 Set up your IDE For this chapter, we will create notebooks from our example files code. Notebooks work best with the integrated development environment (IDE) they were created to work with. IDE’s are sets of tools that help you develop your code. They are part “point and click” and part command line and include lots of visuals that will help guide you. Set up a Python IDE Install JupyterLab We advise using the conda method to install JupyterLab, because we will return to talk more about conda later on, so if you don’t have conda, you will need to install that first. We advise going with Anaconda instead of miniconda. To install Anaconda you can download from here. Download the installer, and follow the installation prompts. Start up Anaconda navigator. On the home page choose JupyterLab and click Install. This may take a few minutes. Now you should be able to click Launch underneath JupyterLab. This will open up a page in your Browser with JupyterLab. Getting familiar with JupyterLab’s interface The JupyterLab interface consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and a menu bar. The left sidebar contains a file browser, the list of running kernels and terminals, the command palette, the notebook cell tools inspector, and the tabs list. The menu bar at the top of JupyterLab has top-level menus that expose actions available in JupyterLab with their keyboard shortcuts. The default menus are: File: actions related to files and directories Edit: actions related to editing documents and other activities View: actions that alter the appearance of JupyterLab Run: actions for running code in different activities such as notebooks and code consoles Kernel: actions for managing kernels, which are separate processes for running code Tabs: a list of the open documents and activities in the dock panel Settings: common settings and an advanced settings editor Help: a list of JupyterLab and kernel help links Set up an R IDE Install RStudio Install RStudio (and install R first if you have not already). After you’ve downloaded the RStudio installation file, double click on it and follow along with the installation prompts. Open up the RStudio application by double clicking on it. Getting familiar with RStudio’s interface The RStudio environment has four main panes, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout). The Editor pane is where you can write R scripts and other documents. Each tab here is its own document. This is your text editor, which will allow you to save your R code for future use. Note that change code here will not run automatically until you run it. The Console pane is where you can interactively run R code. There is also a Terminal tab here which can be used for running programs outside R on your computer The Environment pane primarily displays the variables, sometimes known as objects that are defined during a given R session, and what data or values they might hold. The Help viewer pane has several tabs all of which are pretty important: The Files tab shows the structure and contents of files and folders (also known as directories) on your computer. The Plots tab will reveal plots when you make them The Packages tab shows which installed packages have been loaded into your R session The Help tab will show the help page when you look up a function The Viewer pane will reveal compiled R Markdown documents From Shapiro et al. (2021) More reading about RStudio’s interface: RStudio IDE Cheatsheet (pdf). Navigating the RStudio Interface - R for Epidemiology 5.3.2 Create a notebook! Now, in your respective IDE, we’ll turn our unreproducible scripts into notebooks. In the next chapter we will begin to dive into the code itself, but for now, we’ll get the notebook ready to go. Set up a Python notebook Start a new notebook by going to New > Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file. Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file.” style=“display: block; margin: auto;” /> Create a new code chunk in your notebook. Now copy and paste all of the code from make-heatmap.py into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. Save your Untitled.ipynb file as something that tells us what it will end up doing like make-heatmap.ipynb. For more about using Jupyter notebooks see this by Mike (2021). Set up an R notebook Start a new notebook by going to File > New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file. New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file.” style=“display: block; margin: auto;” /> Practice creating a new chunk in your R notebook by clicking the Code > Insert Chunk button on the toolbar or by pressing Cmd+Option+I (in Mac) or Ctrl + Alt + I (in Windows). (You can also manually type out the back ticks and {}) Delete all the default text in this notebook but keep the header which is surrounded by --- and looks like: title: "R Notebook" output: html_notebook You can feel free to change the title from R Notebook to something that better suits the contents of this notebook. 5. Now copy and paste all of the code from make_heatmap.R into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. 6. Save your untitled.Rmd into something that tells us what it will end up doing like make-heatmap.Rmd. 7. Notice that upon saving your .Rmd file, a new file .nb.html file of the same name is created. Open that file and choose view in Browser. If RStudio asks you to choose a browser, then choose a default browser. 8. This shows the nicely rendered version of your analysis and snapshots whatever output existed when the .Rmd file was saved. For more about using R notebooks see this by Xie, Allaire, and Grolemund (2018). Now that you’ve created your notebook, you are ready to start polishing that code! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["managing-package-versions.html", "Chapter 6 Managing package versions 6.1 Learning Objectives 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) 6.3 Exercise 1: Print out session info 6.4 Exercise 2: Package management", " Chapter 6 Managing package versions 6.1 Learning Objectives As we discussed previously, sometimes two different researchers can run the same code and same data and get different results! What Ruby and Avi may not realize is that although they may have used the same code and data, the software packages that they have on each of their computers might be very different. Even if they have the same software packages, they likely don’t have the same versions and versions can influence results! Different computing environments are not only a headache to detangle, they also can influence the reproducibility of your results (Beaulieu-Jones and Greene 2017). There are multiple ways to deal with variations in computing environments so that your analyses will be reproducible and we will discuss a few different strategies for tackling this problem in this course and its follow up course. But for now, we will start with the least intensive to implement: session info. There are two strategies for dealing with software versions that we will discuss in this chapter. Either of these strategies can be used alone or you can use both. They address different aspects of the computing environment discrepancy problem. 6.1.1 Strategy 1: Session Info - record a list of your packages One strategy to combat different software versions is to list the session info. This is the easiest (though not most comprehensive) method for handling differences in software versions is to have your code list details about your computing environment. Session info can lead to clues as to why results weren’t reproducible. For example, if both Avi and Ruby ran notebooks and included a session info print out it may look like this: Session info shows us that they have different R versions and different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions. If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues for any discrepancies. This can give them items to look into for determining why the results didn’t reproduce as expected. 6.1.2 Strategy 2: Package managers - share a useable snapshot of your environment Package managers can help handle your computing environment for you in a way that you can share them with others. In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. For R and Python versions of the exercises, we will be using different managers, but the foundational strategy will be the same: include a file that someone else could replicate your package set up from. For both exercises, we will download an environment ‘snapshot’ file we’ve set up for you, then we will practice adding a new package to the environments we’ve provided, and add them to your new repository along with the rest of your example project files. For Python, we’ll use conda for package management and store this information in a environment.yml file. For R, we’ll use renv for package management and store this information in a renv.lock file. 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 6.3 Exercise 1: Print out session info Python version of the exercise In your scientific notebook, you’ll need to add two items. 1. Add the import session_info to a code chunk at the beginning of your notebook. 2. Add session_info.show() to a new code chunk at the very end of your notebook. 2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. R version of the exercise In your Rmd file, add a chunk in the very end that looks like this: ```r sessionInfo() ``` ``` ## R version 4.0.2 (2020-06-22) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.3 LTS ## ## Matrix products: default ## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] knitr_1.33 magrittr_2.0.2 hms_0.5.3 R6_2.4.1 ## [5] rlang_0.4.10 highr_0.8 stringr_1.4.0 httr_1.4.2 ## [9] tools_4.0.2 xfun_0.26 jquerylib_0.1.4 htmltools_0.5.0 ## [13] ellipsis_0.3.1 ottrpal_0.1.2 yaml_2.2.1 digest_0.6.25 ## [17] tibble_3.0.3 lifecycle_1.0.0 crayon_1.3.4 bookdown_0.24 ## [21] readr_1.4.0 vctrs_0.3.4 fs_1.5.0 curl_4.3 ## [25] evaluate_0.14 rmarkdown_2.10 stringi_1.5.3 compiler_4.0.2 ## [29] pillar_1.4.6 pkgconfig_2.0.3 ``` Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. 6.4 Exercise 2: Package management Python version of the exercise Download this starter conda environment.yml file by clicking on the link and place it with your example project files directory. Navigate to your example project files directory using command line. Create your conda environment by using this file in the command. conda env create --file environment.yml Activate your conda environment using this command. conda activate reproducible-python Now start up JupyterLab again using this command: jupyter lab Follow these instructions to add the environment.yml file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. 6.4.1 More resources on how to use conda Install Jupyter using your own environment (Mac specific) Definitive guide to using conda R version of the exercise First install the renv package Go to RStudio and the Console pane: Install renv using (you should only need to do this once per your computer or RStudio environment). install.packages("renv") Now set up renv to use in your project Change to your current directory for your project using setwd() in your console window (don’t put this in a script or notebook). Use this command in your project: renv::init() This will start up renv in your particular project *What’s :: about? – in brief it allows you to use a function from a package without loading the entire thing with library(). Now you can develop your project as you normally would; installing and removing packages in R as you see fit. For the purposes of this exercise, let’s install the styler package using the following command. (The styler package will come in handy for styling our code in the next chapter). install.packages("styler") Now that we have installed styler we will want to add it to our renv snapshot. To add any packages we’ve installed to our renv snapshot we will use this command: renv::snapshot() This will save whatever packages we are currently using to our environment snapshot file called renv.lock. This renv.lock file is what we can share with our collaborators so they can replicate our computing environment. If your package installation attempts are unsuccessful and you’d like to revert to the previous state of your environment, you can run renv::restore(). This will restore your renv.lock file to what it was before you attempted to install styler or whatever packages you tried to install. You should see an renv.lock file is now created or updated! You will want to always include this file with your project files. This means we will want to add it to our GitHub! Follow these instructions to add your renv.lock file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. After you’ve added your computing environment files to your GitHub, you’re ready to continue using them with your IDE to actually work on the code in your notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["writing-durable-code.html", "Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.3 More reading on best coding practices 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) 7.5 Exercise 1: Make code more durable! 7.6 Exercise 2: Style code automatically!", " Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.2.0.1 Work on your code iteratively Getting your code to work the first time is the first step, but don’t stop there! Just like in writing a manuscript you wouldn’t consider your first draft a final draft, your polishing code works best in an iterative manner. Although you may need to set it aside for the day to give your brain a rest, return to your code later with fresh eyes and try to look for ways to improve upon it! 7.2.0.2 Prioritize readability over cleverness Some cleverness in code can be helpful, too much can make it difficult for others (including your future self!) to understand. If cleverness comprises the readability of your code, it probably is not worth it. Clever but unreadable code won’t be re-used or trusted by others (AGAIN, including your future self!). What does readable code look like? Orosz (2019) has some thoughts on writing readable code: Readable code starts with code that you find easy to read. When you finish coding, take a break to clear your mind. Then try to re-read the code, putting yourself in the mindset that you know nothing about the changes and why you made them. Can you follow along with your code? Do the variables and method names help understand what they do? Are there comments at places where just the code is not enough? Is the style of the code consistent across the changes? Think about how you could make the code more readable. Perhaps you see some functions that do too many things and are too long. Perhaps you find that renaming a variable would make its purpose clearer. Make changes until you feel like the code is as expressive, concise, and pretty as it can be. The real test of readable code is others reading it. So get feedback from others, via code reviews. Ask people to share feedback on how clear the code is. Encourage people to ask questions if something does not make sense. Code reviews - especially thorough code reviews - are the best way to get feedback on how good and readable your code is. Readable code will attract little to no clarifying questions, and reviewers won’t misunderstand it. So pay careful attention to the cases when you realize someone misunderstood the intent of what you wrote or asked a clarifying question. Every question or misunderstanding hints to opportunities to make the code more readable. A good way to get more feedback on the clarity of your code is to ask for feedback from someone who is not an expert on the codebase you are working on. Ask specifically for feedback on how easy to read your code is. Because this developer is not an expert on the codebase, they’ll focus on how much they can follow your code. Most of the comments they make will be about your code’s readability. We’ll talk a bit more about code review in an upcoming chapter! More reading: Readable Code by Orosz (2019). Write clean R code by Dubel (2021). Python Clean Code: 6 Best Practices to Make Your Python Functions More Readable by Tran (2021). 7.2.0.3 DRY up your code DRY is an acronym: “Don’t repeat yourself” (Smith 2013). “I hate code, and I want as little of it as possible in our product.” – Diederich (2012) If you find yourself writing something more than once, you might want to write a function, or store something as a variable. The added benefit of writing a function is you might be able to borrow it in another project. DRY code is easier to fix and maintain because if it breaks, its easier to fix something in one place, than in 10 places. DRY code is easier on the reviewer because they don’t have to review the same thing twice, but also because they don’t have to review the same thing twice. ;) DRYing code is something that takes some iterative passes and edits through your code, but in the end DRY code saves you and your collaborators time and can be something you reuse again in a future project! Here’s an slightly modified example from Bernardo (2021) for what DRY vs non-DRY code might look like: paste('Hello','John', 'welcome to this course') paste('Hello','Susan', 'welcome to this course') paste('Hello','Matt', 'welcome to this course') paste('Hello','Anne', 'welcome to this course') paste('Hello','Joe', 'welcome to this course') paste('Hello','Tyson', 'welcome to this course') paste('Hello','Julia', 'welcome to this course') paste('Hello','Cathy', 'welcome to this course') Could be functional-ized and rewritten as: GreetStudent <- function(name) { greeting <- paste('Hello', name, 'welcome to this course') return(greeting) } class_names <- c('John', 'Susan', 'Matt' ,'Anne', 'Joe', 'Tyson', 'Julia', 'Cathy') lapply(class_names, GreetStudent) Now, if you wanted to edit the greeting, you’d only need to edit it in the function, instead of in each instance. More reading about this idea: DRY Programming Practices by Klinefelter (2016). Keeping R Code DRY with functions by Riffomonas Project (2021). Write efficient R code for science by Max Joseph (2017). Write efficient Python code by Leah Wasser (2019). Don’t repeat yourself: Python functions by Héroux (2018). 7.2.0.4 Don’t be afraid to delete and refresh a lot Don’t be afraid to delete it all and re-run (multiple times). This includes refreshing your kernel/session in your IDE. In essence, this is the data science version of “Have you tried turning it off and then on again?” Some bugs in your code exist or are not realized because old objects and libraries have overstayed their welcome in your environment. Why do you need to refresh your kernel/session? As a quick example of why refreshing your kernel/session, let’s suppose you are troubleshooting something that centers around an object named some_obj but then you rename this object to iris_df. When you rename this object you may need to update this other places in the code. If you don’t refresh your environment while working on your code, some_obj will still be in your environment. This will make it more difficult for you to find where else the code needs to be updated. Refreshing your kernel/session goes beyond objects defined in your environment, and also can affect packages and dependencies loaded or all kinds of other things attached to your kernel/session. As a quick experiment, try this in your Python or R environment: The dir() and ls() functions list your defined variables in your Python and R environments respectively. In Python: some_obj=[] dir() Now refresh your Kernel and re-run dir() dir() You should see you no longer have some_obj listed as being defined in your environment. In R some_obj <- c() ls() Now refresh your session and re-run ls() ls() You should see you no longer have some_obj listed as being defined in your environment. Keeping around old code and objects is generally more of a hindrance than a time saver. Sometimes it can be easy to get very attached to a chunk of code that took you a long time to troubleshoot but there are three reasons you don’t need to stress about deleting it: You might write better code on the second try (or third or n’th). Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need. With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!) This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it). Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself! 7.2.0.5 Use code comments effectively Good code comments are a part of writing good, readable code! Your code is more likely to stand the test of time for longer if others, including yourself in the future, can see what’s happening enough to trust it themselves. This will encourage others to use your code and help you maintain it! ‘Current You’ who is writing your code may know what is happening but ‘Future You’ will have no idea what ‘Current You’ was thinking (Spielman, n.d.): ‘Future You’ comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. Help out ‘Future You’ by adding lots of comments! ‘Future You’ next week thinks Today You is an idiot, and the only way you can convince ‘Future You’ that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad. Your code and your understanding of it will fade soon after you write it, leaving your hard work to deprecate. Code that works is a start, but readable AND working code is best! Comments can help clarify at points where your code might need further explanation. The best code comments explain the why of what you are doing. The act of writing them can also help you think out your thought process and perhaps identify a better solution to the odd parts of your code. (From Savonen (2021a)) More reading: Creating clarifying code comments Best Practices for Writing Code Comments by Spertus (2021). What Makes a Good Code Comment? by Cronin (2019). The Value of Code Documentation by Meza (2018). Some internet wisdom on R documentation by Frazee (2014). How to Comment Your Code Like a Pro: Best Practices and Good Habits by Keeton (2019). 7.2.0.6 Use informative variable names Try to avoid using variable names that have no meaning like tmp or x, or i. Meaningful variable names make your code more readable! Additionally, variable names that are longer than one letter are much easier to search and replace if needed. One letter variables are hard to replace and hard to read. Don’t be afraid of long variable names, they are very unlikely to be confused! 1 Write intention-revealing names. 2 Use consistent notation for naming convention. 3 Use standard terms. 4 Do not number a variable name. 5 When you find another way to name variable, refactor as fast as possible. (Hobert 2018) More reading: R for Epidemiology - Coding best Practices by Cannell (2021). Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them by Koehrsen (2019). Writing Variable — Informative, Descriptive & Elegant by Hobert (2018). 7.2.0.7 Follow a code style Just like when writing doesN”t FoLLOW conv3nTi0Ns OR_sPAcinng 0r sp3llinG, it can be distracting, the same goes for code. Your code may even work all the same, just like you understood what I wrote in that last sentence, but a lack of consistent style can make require more brain power from your readers for them to understand. For reproducibility purposes, readability is important! The easier you can make it on your readers, the more likely they will be able to understand and reproduce the results. There are different style guides out there that people adhere to. It doesn’t matter so much which one you choose, so much that you pick one and stick to it for a particular project. Python style guides: PEP8 style guide “PEP 8 – Style Guide for Python Code” (2021). Google Python style guide “Styleguide” (2021). R style guides: Hadley Wickham’s Style guide Wickham (2019). Google R style guide “Google’s R Style Guide” (2021). Although writing code following a style as you are writing is a good practice, we’re all human and that can be tricky to do, so we recommend using an automatic styler on your code to fix up your code for you. For Python code, you can use python black and for R, styler. 7.2.0.8 Organize the structure of your code Readable code should follow an organized structure. Just like how outlines help the structure of manuscript writing, outlines can also help the organization of code writing. A tentative outline for a notebook might look like this: A description of the purpose of the code (in Markdown). Import the libraries you will need (including sourcing any custom functions). List any hard-coded variables. Import data. Do any data cleaning needed. The main thing you need to do. Print out session info. Note that if your notebook gets too long, you may want to separate out things in their own scripts. Additionally, it’s good practice to keep custom functions in their own file and import them. This allows you to use them elsewhere and also keeps the main part of the analysis cleaner. 7.2.0.9 Set the seed if your analysis has randomness involved If any randomness is involved in your analysis, you will want to set the seed in order for your results to be reproducible. In brief, computers don’t actually create numbers randomly they create numbers pseudorandomly. But if you want your results to be reproducible, you should give your computer a seed by which to create random numbers. This will allow anyone who re-runs your analysis to have a positive control and eliminate randomness as a reason the results were not reproducible. For more on how setting the seed works – a quick experiment To illustrate how seeds work, run we’ll run a quick experiment with setting the seed here: First let’s set a seed (it doesn’t matter what number we use, just that we pick a number), so let’s use 1234 and then create a “random” number. # Set the seed: set.seed(1234) # Now create a random number again runif(1) ## [1] 0.1137034 Now if we try a different seed, we will get a different “random” number. # Set a different seed: set.seed(4321) # Now create a random number again runif(1) ## [1] 0.334778 But, if we return to the original seed we used, 1234, we will get the original “random” number we got. # Set this back to the original seed set.seed(1234) # Now we'll get the same "random" number we got when we set the seed to 1234 previously runif(1) ## [1] 0.1137034 More reading: Set seed by Soage (2020). Generating random numbers by Chang (2021). 7.2.0.10 To review general principles: 7.3 More reading on best coding practices There’s so many opinions and strategies on best practices for code. And although a lot of these principles are generally applicable, not all of it is one size fits all. Some code practices are context-specific so sometimes you may need to pick and choose what works for you, your team, and your particular project. 7.3.0.1 Python specific: Reproducible Programming for Biologists Who Code Part 2: Should Dos by Heil (2020). 15 common coding mistakes data scientist make in Python (and how to fix them) by Csendes (2020). Data Science in Production — Advanced Python Best Practices by Kostyuk (2020). 6 Mistakes Every Python Beginner Should Avoid While Coding by Saxena (2021). 7.3.0.2 R specific: Data Carpentry’s: Best Practices for Writing R Code by “Best Practices for Writing R Code – Programming with R” (2021). R Programming for Research: Reproducible Research by Good (2021). R for Epidemiology: Coding best practices by Cannell (2021). Best practices for R Programming by Bernardo (2021). 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 7.5 Exercise 1: Make code more durable! 7.5.1 Organize the big picture of the code Before diving in line-by-line it can be helpful to make a code-outline of sorts. What are the main steps you need to accomplish in this notebook? What are the starting and ending points for this particular notebook? For example, for this make-heatmap notebook we want to: Set up analysis folders and declare file names. Install the libraries we need. Import the gene expression data and metadata. Filter down the gene expression data to genes of interest – in this instance the most variant ones. Clean the metadata. Create an annotated heatmap. Save the heatmap to a PNG. Print out the session info! Python version of the exercise The exercise: Polishing code Start up JupyterLab with running jupyter lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you made in the previous chapter make-heatmap.ipynb Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In Jupyter, you refresh your environment by using the refresh icon in the toolbar or by going to Restart Kernel. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. random.seed(1234) Use a relative file path Rationale: Absolute file paths only work for the original writer of the code and no one else. But if we make the file path relative to the project set up, then it will work for whomever has the project repository (Mustafeez 2021). Additionally, we can set up our file path names using f-Strings so that we only need to change the project ID and the rest will be ready for a new dataset (Python 2021)! Although this requires more lines of code, this set up is much more flexible and ready for others to use. Before: df1=pd.read_csv('~/a/file/path/only/I/have/SRP070849.tsv', sep='\\t') mdf=pd.read_csv('~/a/file/path/only/I/have/SRP070849_metadata.tsv', sep='\\t') After: # Declare project ID id = "SRP070849" # Define the file path to the data directory data_dir = Path(f"data/{id}") # Declare the file path to the gene expression matrix file data_file = data_dir.joinpath(f"{id}.tsv") # Declare the file path to the metadata file # inside the directory saved as `data_dir` metadata_file = data_dir.joinpath(f"metadata_{id}.tsv") # Read in metadata TSV file metadata = pd.read_csv(metadata_file, sep="\\t") # Read in data TSV file expression_df = pd.read_csv(data_file, sep="\\t") Related readings: f-strings in Python by Geeks (2018). f-Strings: A New and Improved Way to Format Strings in Python by Python (2021). Relative vs absolute file paths by Mustafeez (2021). About join path by “Python Examples of Pathlib.Path.joinpath” (2021). Avoid using mystery numbers Rationale: Avoid using numbers that don’t have context around them in the code. Include the calculations for the number, or if it needs to be hard-coded, explain the rationale for that number in the comments. Additionally, using variable and column names that tell you what is happening, helps clarify what the number represents. Before: df1['calc'] =df1.var(axis = 1, skipna = True) df2=df1[df1.calc >float(10)] After: # Calculate the variance for each gene expression_df["variance"] = expression_df.var(axis=1, skipna=True) # Find the upper quartile for these data upper_quartile = expression_df["variance"].quantile([0.90]).values # Filter the data choosing only genes whose variances are in the upper quartile df_by_var = expression_df[expression_df.variance > float(upper_quartile)] Related readings: - Stop Using Magic Numbers and Variables in Your Code by Aaberge (2021). Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing, we didn’t check for this before. After: print(metadata["refinebio_accession_code"].tolist() == expression_df.columns.tolist()) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example Python repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R version of the exercise About the tidyverse:. Before we dive into the exercise, a word about the tidyverse. The tidyverse is a highly useful set of packages for creating readable and reproducible data science workflows in R. In general, we will opt for tidyverse approaches in this course, and strongly encourage you to familiarize yourself with the tidyverse if you have not. We will point out some instances where tidyverse functions can help you DRY up your code as well as make it more readable! More reading on the tidyverse: Tidyverse Skills for Data Science by Carrie Wright (n.d.). A Beginner’s Guide to Tidyverse by A. V. Team (2019). Introduction to tidyverse by Shapiro et al. (2021). The exercise: Polishing code Open up RStudio. Open up the notebook you created in the previous chapter. Now we’ll work on applying the principles from this chapter to the code. We’ll cover some of the points here, but then we encourage you to dig into the fully transformed notebook we will link at the end of this section. Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In RStudio, you refresh your environment by going to the Run menu and using Restart R and refresh clear output. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. set.seed(1234) Get rid of setwd Rationale: setwd() almost never work for anyone besides the one person who wrote it. And in a few days/weeks it may not work for them either. Before: setwd("Super specific/filepath/that/noone/else/has/") After: Now that we are working from a notebook, we know that the default current directory is wherever the notebook is placed (Xie, Dervieux, and Riederer 2020). Related readings: Jenny Bryan will light your computer on fire if you use setwd() in a script (Bryan 2017). Give the variables more informative names Rationale: xx doesn’t tell us what is in the data here. Also by using the readr::read_tsv() from tidyverse we’ll get a cleaner, faster read and won’t have to specify sep argument. Note we are also fixing some spacing and using <- so that we can stick to readability conventions. You’ll notice later Before: xx=read.csv("metadata_SRP070849.tsv", sep = "\\t") After: metadata <- readr::read_tsv("metadata_SRP070849.tsv") Related readings: readr::read_tsv() documentation by “Read a Delimited File (Including CSV and TSV) into a Tibble — Read_delim” (n.d.). DRYing up data frame manipulations Rationale: This chunk of code can be very tricky to understand what it is doing. What is happening with df1 and df2? What’s being filtered out? etc. Code comments would certainly help understanding, but even better, we can DRY this code up and make the code clearer on its own. Before: It may be difficult to tell from looking at the before code because there are no comments and it’s a bit tricky to read, but the goal of this is to: Calculate variances for each row (each row is a gene). Filter the original gene expression matrix to only genes have a bigger variance (here we use arbitrarily 10 as a filter cutoff). df=read.csv("SRP070849.tsv", sep="\\t") sums=matrix(nrow = nrow(df), ncol = ncol(df) - 1) for(i in 1:nrow(sums)) { sums[i, ] <- sum(df[i, -1]) } df2=df[which(df[, -1] >= 10), ] variances=matrix(nrow = nrow(dds), ncol = ncol(dds) - 1) for(i in 1:nrow(dds)) { variances[i, ] <- var(dds[i, -1]) } After: Let’s see how we can do this in a DRY’er and clearer way. We can: 1) Add comments to describe our goals. 2) Use variable names that are more informative. 3) Use the apply functions to do the loop for us – this will eliminate the need for unclear variable i as well. 4) Use the tidyverse to do the filtering for us so we don’t have to rename data frames or store extra versions of df. Here’s what the above might look like after some refactoring. Hopefully you find this is easier to follow and total there’s less lines of code (but also has comments too!). # Read in data TSV file expression_df <- readr::read_tsv(data_file) %>% # Here we are going to store the gene IDs as row names so that # we can have only numeric values to perform calculations on later tibble::column_to_rownames("Gene") # Calculate the variance for each gene variances <- apply(expression_df, 1, var) # Determine the upper quartile variance cutoff value upper_var <- quantile(variances, 0.75) # Filter the data choosing only genes whose variances are in the upper quartile df_by_var <- data.frame(expression_df) %>% dplyr::filter(variances > upper_var) Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing... we didn't check for this :( After: # Make the data in the order of the metadata expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order all.equal(colnames(expression_df), metadata$refinebio_accession_code) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example R repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) Now that we’ve made some nice updates to the code, we are ready to do a bit more polishing by adding more documentation! But before we head to the next chapter, we can style the code we wrote automatically by using automatic code stylers! 7.6 Exercise 2: Style code automatically! Styling Python code automatically Run your notebook through black. First you’ll need to install it by running this command in a Terminal window in your JupyterLab. Make sure you are running this within your conda environment. conda activate reproducible-python Now install this python black. pip install black[jupyter] To record your conda environment run this command. conda env export > environment-record.yml Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: python -m black make-heatmap.ipynb You should get a message that your notebook was styled! Styling R code automatically Let’s run your notebook through styler. First you’ll need to install it and add it to your renv. install.packages("styler") Then add it to your renv by running: renv::snapshot() Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: styler::style_file("make-heatmap.Rmd") You should get a message that your notebook was styled! Before you are done with this exercise, there’s one more thing we need to do: upload the latest version to GitHub. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["documenting-analyses.html", "Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) 8.4 Exercise 1: Practice beefing up your notebook descriptions 8.5 Exercise 2: Write a README for your project!", " Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? Documentation is an important but sometimes overlooked part of creating a reproducible analysis! There are two parts of documentation we will discuss here: 1) In notebook descriptions and 2) READMEs Both these notebook descriptions and READMEs are written in markdown – a shorthand for html (the same as the documentation parts of your code). If you aren’t familiar, markdown is such a handy tool and we encourage you to learn it (it doesn’t take too long), here’s a quick guide to get you started. 8.2.1 Notebook descriptions As we discussed in chapter 5, data analyses can lead one on a winding trail of decisions, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! Your scientific notebook should include descriptions that describe: 8.2.1.1 The purposes of the notebook What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? 8.2.1.2 The rationales behind your decisions Describe why a particular code chunk is doing a particular thing – the more odd the code looks, the greater need for you to describe why you are doing it. Describe any particular filters or cutoffs you are using and how did you decide on those? For data wrangling steps, why are you wrangling the data in such a way – is this because a certain package you are using requires it? 8.2.1.3 Your observations of the results What do you think about the results? The plots and tables you show in the notebook – how do they inform your original questions? 8.2.2 READMEs! READMEs are also a great way to help your collaborators get quickly acquainted with the project. READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called “README.md” when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. Information that should be included in a README: General purpose of the project Instructions on how to re-run the project Lists of any software required by the project Input and output file descriptions. Descriptions of any additional tools included in the project? You can take a look at this template README to get your started. 8.2.2.1 More about writing READMEs: How to write a good README file A Beginners Guide to writing a Kicka** README How to write an awesome README 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions). 8.4 Exercise 1: Practice beefing up your notebook descriptions Python project exercise Start up JuptyerLab with running juptyer lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.ipynb Create a new chunk in your notebook and choose the “Markdown” option in the dropdown menu. ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gfaa026a583_0_30") 5. Continue to add more descriptions where you feel is necessary, You can reference the descriptions we have in the “final” version looks like in the example Python repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R project exercise Open up RStudio. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.Rmd In between code chunks, add more descriptions using Markdown language. You can test how this renders by saving your .Rmd and then opening up the resulting nb.html file and choosing View in Browser. Continue to add more descriptions where you feel is necessary. You can reference the descriptions we have in the “final” version looks like in the example R repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) 8.5 Exercise 2: Write a README for your project! Download this template README. Fill in the questions inside the { } to create a README for this project. You can reference the “final” versions of the README, but keep in mind it will reference items that we will discuss in the “advanced” portion of this course. See the R README here and the Python README here. Add your README and updated notebook to your GitHub repository. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["code-review.html", "Chapter 9 Code review 9.1 Learning Objectives 9.2 Exercise: Set up your code review request!", " Chapter 9 Code review 9.1 Learning Objectives We’ve previously discussed that the only way to know if your analysis is truly reproducible is to send it to someone else to reproduce! That sentiment is at the heart of code review. Although most wouldn’t dare send out a manuscript for publishing without having our collaborators giving it a line-by-line review, people don’t always feel the same way about code. Parker (2017) describes code review: Code review will not guarantee an accurate analysis, but it’s one of the most reliable ways of establishing one that is more accurate than before. Not only does code review help boost the accuracy and reproducibility of the analysis, it also helps everyone involved in the process learn something new! 9.1.0.1 Recommended reading about code review Code Review Guidelines for Humans by Hauer (2018). Your Code Sucks! – Code Review Best Practices by Hildebr (2020). Best practices for Code Review by S. Team (2021). Why code reviews matter (and actually save time!) by Radigan (2021). 9.2 Exercise: Set up your code review request! Since reproducibility is all about someone else being able to run your code and obtain your results, the exercise in this course involves preparing to do just that! The goal: In the second part of this reproducibility course we will discuss how to conduct formal line-by-line code reviews, but for now, we will discuss how to prep your analysis for someone else look at your code and attempt to run it. At this point, you should have a GitHub repository that contains the following: A make_heatmap notebook A README A data folder containing the metadata and gene expression matrix files in a folder named SRP070849: SRP070849/metadata_SRP070849.tsv SRP070849/SRP070849.tsv 1) Refresh and delete output Before you send off your code to someone else, delete your output (the results and plots folders) and attempt to re-run it yourself. This also involves restarting your R session/Python kernel and running all the chunks again. 2) Re-run the whole analysis 3) Interrogate and troubleshoot If your code has any issues running, try your best to troubleshoot the problems. Read this handy guide for tips on troubleshooting R. 4) Rinse and repeat Repeat this as many times as needed until you reliably are able to re-run this code and get the same results without any code smells popping up. Dig into bad code smells or bad results smells wherever you sense them. If you aren’t sure why you feel this way about your code or results, hold on to this and it may be something your collaborator will be able to see something you don’t. 5) Let it simmer Leave your analysis for a bit. Do you think it’s perfect? Are you at your wits end with it? No matter how you feel about it, let it sit for a half a day or so then return to it with fresh eyes (Savonen 2021b). 5) Re-review your documentation and code with fresh eyes Now with fresh eyes and doing your best to imagine you don’t have the knowledge you have – does your analysis and results make sense? 6) Are you sure it’s ready? Ask yourself if you’ve polished this code and documentation as far as you can reasonably take it? Realizing that determining what qualifies as far as you can reasonably take it is also a skill you will build with time. Code review is the most efficient use of everyone’s time when your code and documentation have reached this point. 8) Draft your request Now you are ready to send this code to your collaborator, but first try to send them a specific set of instructions and questions about what you would like them to review, in your message to them include this information (You may want to draft this out in a scratch file): Code review requests should include: A link to your repository that has your README to get them quickly oriented to the project. A request for what kind of feedback you are looking for. Big picture? Technical? Method selection? Are there specific areas of the code you are having trouble with or are unsure about? Send a link to the specific lines in GitHub you are asking about. Are there results that are surprising, confusing, or smell wrong? Be sure to detail what you have dug into and tried at this point for any problematic points. Explicitly ask them what commands or tests you’d like them to test run. Lastly, thank them for helping review your code! 9) Ready for review Now you are ready to send your crafted message to your collaborator for review. But, for the purposes of this exercise, you may not want to ask your collaborator to spend their time carefully review this practice repository, but now that you understand and have done the steps involved you are prepared to do this for your own analyses! TL;DR for asking for a code review: Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! In the second part of this course, we will discuss how to conduct code review through GitHub, further utilize version control, and more! References "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines. Credits Names Pedagogy Lead Content Instructor(s) Candace Savonen Lecturer(s) Candace Savonen Content Directors Jeff Leek, Sarah Wheelan Content Consultants [David Swiderski] Acknowledgments [Patrick O’Connell] Production Content Publisher Ira Gooding Content Publishing Reviewers Ira Gooding Technical Course Publishing Engineer Candace Savonen Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal)John Muschelli, Candace Savonen, Carrie Wright Art and Design Illustrator Candace Savonen Figure Artist Candace Savonen Videographer Candace Savonen Videography Editor Candace Savonen Funding Funder National Cancer Institute (NCI) UE5 CA254170 Funding Staff Emily Voeglein, Fallon Bachman ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.3 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2022-10-13 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3) ## bookdown 0.24 2022-02-15 [1] Github (rstudio/bookdown@88bc4ea) ## callr 3.4.4 2020-09-07 [1] RSPM (R 4.0.2) ## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.6.1 2022-01-22 [1] CRAN (R 4.0.2) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## knitr 1.33 2022-02-15 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) ## magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.3.4 2020-08-11 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 0.4.10 2022-02-15 [1] Github (r-lib/rlang@f0c9be5) ## rmarkdown 2.10 2022-02-15 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2022-02-15 [1] Github (R-lib/testthat@e99155a) ## usethis 2.1.5.9000 2022-02-15 [1] Github (r-lib/usethis@57b109a) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2022-02-15 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library David Swiderski Patrick O’Connell "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
+[["index.html", "Intro to Reproducibility in Cancer Informatics About this Course 0.1 Available course formats", " Intro to Reproducibility in Cancer Informatics March, 2024 About this Course This course is part of a series of courses for the Informatics Technology for Cancer Research (ITCR) called the Informatics Technology for Cancer Research Education Resource. This material was created by the ITCR Training Network (ITN) which is a collaborative effort of researchers around the United States to support cancer informatics and data science training through resources, technology, and events. This initiative is funded by the following grant: National Cancer Institute (NCI) UE5 CA254170. Our courses feature tools developed by ITCR Investigators and make it easier for principal investigators, scientists, and analysts to integrate cancer informatics into their workflows. Please see our website at www.itcrtraining.org for more information. 0.1 Available course formats This course is available in multiple formats which allows you to take it in the way that best suites your needs. You can take it for certificate which can be for free or fee. The material for this course can be viewed without login requirement on this Bookdown website. This format might be most appropriate for you if you rely on screen-reader technology. This course can be taken for free certification through Leanpub. This course can be taken on Coursera for certification here (but it is not available for free on Coursera). Our courses are open source, you can find the source material for this course on GitHub. "],["introduction.html", "Chapter 1 Introduction 1.1 Target Audience 1.2 Topics covered: 1.3 Motivation 1.4 Curriculum 1.5 How to use the course", " Chapter 1 Introduction 1.1 Target Audience The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research and have not had training in reproducibility tools and methods. This course is written for individuals who: Have some familiarity with R or Python - have written some scripts. Have not had formal training in computational methods. Have limited or no familiar with GitHub, Docker, or package management tools. 1.2 Topics covered: This is a two part series: 1.3 Motivation Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods. Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort (Beaulieu-Jones and Greene 2017). Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite that it is fundamental to the scientific method. Despite the lack of incentive, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively. Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. Reproducible analyses are more likely to be understood, applied, and replicated by others. This helps expedite the scientific process by helping researchers avoid false positive dead ends. Open source clarity in reproducible methods also saves researchers’ time so they don’t have to reinvent the proverbial wheel for methods that everyone in the field is already performing. 1.4 Curriculum This course introduces the concepts of reproducibility and replicability in the context of cancer informatics. It uses hands-on exercises to demonstrate in practical terms how to increase the reproducibility of data analyses. The course also introduces tools relevant to reproducibility including analysis notebooks, package managers, git and GitHub. The course includes hands-on exercises for how to apply reproducible code concepts to their code. Individuals who take this course are encouraged to complete these activities as they follow along with the course material to help increase the reproducibility of their analyses. Goal of this course: Equip learners with reproducibility skills they can apply to their existing analyses scripts and projects. This course opts for an “ease into it” approach. We attempt to give learners doable, incremental steps to increase the reproducibility of their analyses. What is not the goal This course is meant to introduce learners to the reproducibility tools, but it does not necessarily represent the absolute end-all, be-all best practices for the use of these tools. In other words, this course gives a starting point with these tools, but not an ending point. The advanced version of this course is the next step toward incrementally “better practices”. 1.5 How to use the course This course is designed with busy professional learners in mind – who may have to pick up and put down the course when their schedule allows. Each exercise has the option for you to continue along with the example files as you’ve been editing them in each chapter, OR you can download fresh chapter files that have been edited in accordance with the relative part of the course. This way, if you decide to skip a chapter or find that your own files you’ve been working on no longer make sense, you have a fresh starting point at each exercise. References "],["defining-reproducibility.html", "Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility 2.3 Reproducibility in daily life 2.4 Reproducibility is worth the effort! 2.5 Reproducibility exists on a continuum!", " Chapter 2 Defining reproducibility 2.1 Learning Objectives 2.2 What is reproducibility There’s been a lot of discussion about what is included in the term reproducibility and there is some discrepancy between fields. For the purposes of informatics and data analysis, a reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found. Reproducibility is related to repeatability and replicability but it is worth taking time to differentiate these terms Perhaps you are like Ruby and have just found an interesting pattern through your data analysis! This has probably been the result of many months or years on your project and it’s worth celebrating! But before she considers these results a done deal, Ruby should test whether she is able to re-run her own analysis and get the same results again. This is known as repeatability. Given that Ruby’s analysis is repeatable; she may feel confident now to share her preliminary results with her colleague, Avi the Associate. Whether or not someone else will be able to take Ruby’s code and data, re-run the analysis and obtain the same results is known as reproducibility. If Ruby’s results are able to be reproduced by Avi, now Avi may collect new data and use Ruby’s same analysis methods to analyze his data. Whether or not Avi’s new data and results concur with Ruby’s study’s original inferences is known as replicability. You may realize that these levels of research build on each other (like science is supposed to do). In this way, we can think of these in a hierarchy. Skipping any of these levels of research applicability can lead to unreliable results and conclusions. Science progresses when data and hypotheses are put through these levels thoroughly and sequentially. If results are not repeatable, they won’t be reproducible or replicable. Ideally all analyses and results would be reproducible without too much time and effort spent; this would aid in the efficiency of research getting to the next stages and questions. But unfortunately, in practice, reproducibility is not as commonplace as we would hope. Institutions and reward systems generally do not prioritize or even measure reproducibility standards in research and training opportunities for reproducible techniques can be scarce. Reproducible research can often feel like an uphill battle that is made steeper by lack of training opportunities. In this course, we hope to equip your research with the tools you need to enhance the reproducibility of your analyses so this uphill battle is less steep. 2.3 Reproducibility in daily life What does reproducibility mean in the daily life of a researcher? Let’s say Ruby’s results are repeatable in her own hands and she excitedly tells her associate, Avi, about her preliminary findings. Avi is very excited about these results as well as Ruby’s methods! Avi is also interested in Ruby’s analysis methods and results. So Ruby sends Avi the code and data she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby’s analysis is reproducible. Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi’s computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author (Beaulieu-Jones and Greene 2017). Avi is encountering errors because Ruby’s code was written with Ruby’s computer and local setup in mind and she didn’t know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby’s same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors). Avi is still struggling to work with Ruby’s code and is confused about the goals and approaches the code is taking. After struggling with Avi’s code for an untold amount of time, Avi may decide it’s time to email Ruby to get some clarity. Now both Avi and Ruby are confused about why this analysis isn’t nicely re-running for Avi. Their attempts to communicate about the code through email haven’t helped them clarify anything. Multiple versions of the code may have been sent back and forth between them and now things are taking a lot more time than either of them expected. Perhaps at some point Avi is able to successfully run Ruby’s code on Ruby’s same data. Just because Avi didn’t get any errors doesn’t mean that the code ran exactly the same as it did for Ruby. Lack of errors also doesn’t mean that either Ruby or Avi’s runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments. 2.4 Reproducibility is worth the effort! Perhaps you’ve found yourself in a situation like Ruby and Avi; struggling to re-run code that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects’ reproducibility. As you apply these reproducible techniques to your own projects, you may feel like it is taking more time to reach endpoints, but keep in mind that reproducible analyses and projects have higher upfront costs but these will absolutely pay off in the long term. Reproducibility in your analyses is not only a time saver for yourself, but also your colleagues, your field, and your future self! You might not change a single character in your code but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code stands the test of time longer, making ‘future you’ glad you spent the time to work on it. It’s said that your closest collaborator is you from 6 months ago but you don’t reply to email (Broman 2016). Many a data scientist has referred to their frustration with their past selves: Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley — Hadley Wickham (@hadleywickham) April 7, 2016 The more you comment your code, and make it clear and readable, your future self will thank you. Reproducible code also saves your colleagues time! The more reproducible your code is, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your code and need to try to fix it, the more time is wasted. This can add up to a lot of wasted researcher time and effort. But, reproducible code saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your code and analyses in the future! 2.5 Reproducibility exists on a continuum! Incremental work on your analyses is good! You do not need to make your analyses perfect on the first try or even within a particular time frame. The first step in creating an analysis is to get it to work once! But the work does not end there. Furthermore, no analysis is or will ever be perfect in that it will not be reproducible in every single context throughout time. incrementally pushing our analyses toward the right of this continuum is the goal. References "],["organizing-your-project.html", "Chapter 3 Organizing your project 3.1 Learning Objectives 3.2 Organizational strategies 3.3 Readings about organizational strategies for data science projects: 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) 3.5 Exercise: Organize your project!", " Chapter 3 Organizing your project 3.1 Learning Objectives Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped. Tayo (2019) discusses four particular reasons why it is important to organize your project: Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. Organization is yet another aspect of reproducibility that saves you and your colleagues time! 3.2 Organizational strategies There’s a lot of ways to keep your files organized, and there’s not a “one size fits all” organizational solution (Shapiro et al. 2021). In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. The most important aspects of your project organization scheme is that it: Is project-oriented (Bryan 2017). Follows consistent patterns (Shapiro et al. 2021). Is easy for you and others to find the files you need quickly (Shapiro et al. 2021). Minimizes the likelihood for errors (like writing over files accidentally) (Shapiro et al. 2021). Is something maintainable (Shapiro et al. 2021)! 3.2.1 Tips for organizing your project: Getting more specific, here’s some ideas of how to organize your project: Make file names informative to those who don’t have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders – these only serve to make reading in files a nightmare in some programs. Number scripts in the order that they are run. Keep like-files together in their own directory: results tables with other results tables, etc. Including most importantly keeping raw data separate from processed data or other results! Put source scripts and functions in their own directory. Things that should never need to be called directly by yourself or anyone else. Put output in its own directories like results and plots. Have a central document (like a README) that describes the basic information about the analysis and how to re-run it. Make it easy on yourself, dates aren’t necessary. The computer keeps track of those. Make a central script that re-runs everything – including the creation of the folders! (more on this in a later chapter) Let’s see what these principles might look like put into practice. 3.2.1.1 Example organizational scheme Here’s an example of what this might look like: project-name/ ├── run_analysis.sh ├── 00-download-data.sh ├── 01-make-heatmap.Rmd ├── README.md ├── plots/ │ └── project-name-heatmap.png ├── results/ │ └── top_gene_results.tsv ├── raw-data/ │ ├── project-name-raw.tsv │ └── project-name-metadata.tsv ├── processed-data/ │ ├── project-name-quantile-normalized.tsv └── util/ ├── plotting-functions.R └── data-wrangling-functions.R What these hypothetical files and folders contain: run_analysis.sh - A central script that runs everything again 00-download-data.sh - The script that needs to be run first and is called by run_analysis.sh 01-make-heatmap.Rmd - The script that needs to be run second and is also called by run_analysis.sh README.md - The document that has the information that will orient someone to this project, we’ll discuss more about how to create a helpful README in an upcoming chapter. plots - A folder of plots and resulting images results - A folder results raw-data - Data files as they first arrive and nothing has been done to them yet. processed-data - Data that has been modified from the raw in some way. util - A folder of utilities that never needs to be called or touched directly unless troubleshooting something 3.3 Readings about organizational strategies for data science projects: But you don’t have to take my organizational strategy, there are lots of ideas out there. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: Jenny Bryan’s organizational strategies (Bryan and Hester 2021). Danielle Navarro’s organizational strategies Navarro (2021) Jenny Bryan on Project-oriented workflows(Bryan 2017). Data Carpentry mini-course about organizing projects (“Project Organization and Management for Genomics” 2021). Andrew Severin’s strategy for organization (Severin 2021). A BioStars thread where many individuals share their own organizational strategies (“How Do You Manage Your Files & Directories for Your Projects?” 2010). Data Carpentry course chapter about getting organized (“Introduction to the Command Line for Genomics” 2019). 3.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 3.5 Exercise: Organize your project! Using your computer’s GUI (drag, drop, and clicking), organize the files that are part of this project. Organized these files using an organizational scheme similar to what is described above. Create folders like plots, results, and data folder. Note that aggregated_metadata.json and LICENSE.TXT also belong in the data folder. You will want to delete any files that say “OLD”. Keeping multiple versions of your scripts around is a recipe for mistakes and confusion. In the advanced course we will discuss how to use version control to help you track this more elegantly. After your files are organized, you are ready to move on to the next chapter and create a notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["making-your-project-open-source-with-github.html", "Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) 4.3 Exercise: Set up a project on GitHub", " Chapter 4 Making your project open source with GitHub 4.1 Learning Objectives git is a version control system that is a great tool for creating reproducible analyses. What is version control? Ruby here is experiencing a lack of version control and could probably benefit from using git. All of us at one point or another have created different versions of a file or document, but for analysis projects this can easily get out of hand if you don’t have a system in place. That’s where git comes in handy. There are other version control systems as well, but git is the most popular in part because it works with GitHub, an online hosting service for git controlled files. 4.1.1 GitHub and git allow you to… 4.1.1.1 Maintain transparent analyses Open and transparent analyses are a critical part to conducting open science. GitHub allows you to conduct your analyses in an open source manner. Open science also allows others to better understand your methods and potentially borrow them for their own research, saving everyone time! 4.1.1.2 Have backups of your code and analyses at every point Life happens, sometimes you misplace a file or your computer malfunctions. If you ever lose data on your computer or need to retrieve something from an earlier version of your code, GitHub allows you to revert your losses. 4.1.1.3 Keep a documented history of your project Overtime in a project, a lot happens, especially when it comes to exploring and handling data. Sometimes the rationale behind decisions that were made around an analysis can get lost. GitHub keeps communications and tracks the changes to your files so that you don’t have to revisit a question you already answered. 4.1.1.4 Collaborate with others Analysis projects highly benefit from good collaborations! But having multiple copies of code on multiple collaborators’ computers can be a nightmare to keep straight. GitHub allows people to work on the same set of code concurrently but still have a method to integrate all the edits together in a systematic way. 4.1.1.5 Experiment with your analysis Data science projects often lead to side analyses that could be very worth while but might be scary to venture on if you don’t have your code well version controlled. Git and GitHub allow you to venture on these side experiments without fear since your main code can be kept safe from your side venture. 4.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 4.3 Exercise: Set up a project on GitHub Go here for the video tutorial version of this exercise. Now that we understand how useful GitHub is for creating reproducible analyses, it’s time to set ourselves up on GitHub. Git and GitHub have a whole rich world of tools and terms that can get complex quickly, but for this exercise, we will not worry about those terms and functionalities just yet, but focus on getting code up on GitHub so we are ready to collaborate and conduct open analyses! Go to Github’s main page and click Sign Up if you don’t have an account. Follow these instructions to create a repository. As a general, but not absolute rule, you will want to keep one GitHub repository for one analysis project. Name the repository something that reminds you what its related to. Choose Public. Check the box that says Add a README. Follow these instructions to add the example files you downloaded to your new repository. Congrats! You’ve started your very own project on GitHub! We encourage you to do the same with your own code and other projects! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["using-notebooks.html", "Chapter 5 Using Notebooks 5.1 Learning Objectives 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) 5.3 Exercise: Convert code into a notebook!", " Chapter 5 Using Notebooks 5.1 Learning Objectives Notebooks are a handy way to have the code, output, and scientist’s thought process all documented in one place that is easy for others to read and follow. The notebook environment is incredibly useful for reproducible data science for a variety of reasons: 5.1.0.1 Reason 1: Notebooks allow for tracking data exploration and encourage the scientist to narrate their thought process: Each executed code cell is an attempt by the researcher to achieve something and to tease out some insight from the data set. The result is displayed immediately below the code commands, and the researcher can pause and think about the outcome. As code cells can be executed in any order, modified and re-executed as desired, deleted and copied, the notebook is a convenient environment to iteratively explore a complex problem. (Fangohr 2021) 5.1.0.2 Reason 2: Notebooks allow for easy sharing of results: Notebooks can be converted to html and pdf, and then shared as static read-only documents. This is useful to communicate and share a study with colleagues or managers. By adding sufficient explanation, the main story can be understood by the reader, even if they wouldn’t be able to write the code that is embedded in the document. (Fangohr 2021) 5.1.0.3 Reason 3: Notebooks can be re-ran as a script or developed interactively: A common pattern in science is that a computational recipe is iteratively developed in a notebook. Once this has been found and should be applied to further data sets (or other points in some parameter space), the notebook can be executed like a script, for example by submitting these scripts as batch jobs. (Fangohr 2021) This can also be handy especially if you use automation to enhance the reproducibility of your analyses (something we will talk about in the advanced part of this course). Because of all of these reasons, we encourage the use of computational notebooks as a means of enhancing reproducibility. (This course itself is also written with the use of notebooks!) 5.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 5.3 Exercise: Convert code into a notebook! 5.3.1 Set up your IDE For this chapter, we will create notebooks from our example files code. Notebooks work best with the integrated development environment (IDE) they were created to work with. IDE’s are sets of tools that help you develop your code. They are part “point and click” and part command line and include lots of visuals that will help guide you. Set up a Python IDE Install JupyterLab We advise using the conda method to install JupyterLab, because we will return to talk more about conda later on, so if you don’t have conda, you will need to install that first. We advise going with Anaconda instead of miniconda. To install Anaconda you can download from here. Download the installer, and follow the installation prompts. Start up Anaconda navigator. On the home page choose JupyterLab and click Install. This may take a few minutes. Now you should be able to click Launch underneath JupyterLab. This will open up a page in your Browser with JupyterLab. Getting familiar with JupyterLab’s interface The JupyterLab interface consists of a main work area containing tabs of documents and activities, a collapsible left sidebar, and a menu bar. The left sidebar contains a file browser, the list of running kernels and terminals, the command palette, the notebook cell tools inspector, and the tabs list. The menu bar at the top of JupyterLab has top-level menus that expose actions available in JupyterLab with their keyboard shortcuts. The default menus are: File: actions related to files and directories Edit: actions related to editing documents and other activities View: actions that alter the appearance of JupyterLab Run: actions for running code in different activities such as notebooks and code consoles Kernel: actions for managing kernels, which are separate processes for running code Tabs: a list of the open documents and activities in the dock panel Settings: common settings and an advanced settings editor Help: a list of JupyterLab and kernel help links Set up an R IDE Install RStudio Install RStudio (and install R first if you have not already). After you’ve downloaded the RStudio installation file, double click on it and follow along with the installation prompts. Open up the RStudio application by double clicking on it. Getting familiar with RStudio’s interface The RStudio environment has four main panes, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout). The Editor pane is where you can write R scripts and other documents. Each tab here is its own document. This is your text editor, which will allow you to save your R code for future use. Note that change code here will not run automatically until you run it. The Console pane is where you can interactively run R code. There is also a Terminal tab here which can be used for running programs outside R on your computer The Environment pane primarily displays the variables, sometimes known as objects that are defined during a given R session, and what data or values they might hold. The Help viewer pane has several tabs all of which are pretty important: The Files tab shows the structure and contents of files and folders (also known as directories) on your computer. The Plots tab will reveal plots when you make them The Packages tab shows which installed packages have been loaded into your R session The Help tab will show the help page when you look up a function The Viewer pane will reveal compiled R Markdown documents From Shapiro et al. (2021) More reading about RStudio’s interface: RStudio IDE Cheatsheet (pdf). Navigating the RStudio Interface - R for Epidemiology 5.3.2 Create a notebook! Now, in your respective IDE, we’ll turn our unreproducible scripts into notebooks. In the next chapter we will begin to dive into the code itself, but for now, we’ll get the notebook ready to go. Set up a Python notebook Start a new notebook by going to New > Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file. Notebook. Then open up this chapter’s example code folder and open the make-heatmap.py file.” style=“display: block; margin: auto;” /> Create a new code chunk in your notebook. Now copy and paste all of the code from make-heatmap.py into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. Save your Untitled.ipynb file as something that tells us what it will end up doing like make-heatmap.ipynb. For more about using Jupyter notebooks see this by Mike (2021). Set up an R notebook Start a new notebook by going to File > New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file. New Files > R Notebook. Then open up this chapter’s example code folder and open the make_heatmap.R file.” style=“display: block; margin: auto;” /> Practice creating a new chunk in your R notebook by clicking the Code > Insert Chunk button on the toolbar or by pressing Cmd+Option+I (in Mac) or Ctrl + Alt + I (in Windows). (You can also manually type out the back ticks and {}) Delete all the default text in this notebook but keep the header which is surrounded by --- and looks like: title: "R Notebook" output: html_notebook You can feel free to change the title from R Notebook to something that better suits the contents of this notebook. 5. Now copy and paste all of the code from make_heatmap.R into a new chunk. We will later break up this large chunk of code into smaller chunks that are thematic in the next chapter. 6. Save your untitled.Rmd into something that tells us what it will end up doing like make-heatmap.Rmd. 7. Notice that upon saving your .Rmd file, a new file .nb.html file of the same name is created. Open that file and choose view in Browser. If RStudio asks you to choose a browser, then choose a default browser. 8. This shows the nicely rendered version of your analysis and snapshots whatever output existed when the .Rmd file was saved. For more about using R notebooks see this by Xie, Allaire, and Grolemund (2018). Now that you’ve created your notebook, you are ready to start polishing that code! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["managing-package-versions.html", "Chapter 6 Managing package versions 6.1 Learning Objectives 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) 6.3 Exercise 1: Print out session info 6.4 Exercise 2: Package management", " Chapter 6 Managing package versions 6.1 Learning Objectives As we discussed previously, sometimes two different researchers can run the same code and same data and get different results! What Ruby and Avi may not realize is that although they may have used the same code and data, the software packages that they have on each of their computers might be very different. Even if they have the same software packages, they likely don’t have the same versions and versions can influence results! Different computing environments are not only a headache to detangle, they also can influence the reproducibility of your results (Beaulieu-Jones and Greene 2017). There are multiple ways to deal with variations in computing environments so that your analyses will be reproducible and we will discuss a few different strategies for tackling this problem in this course and its follow up course. But for now, we will start with the least intensive to implement: session info. There are two strategies for dealing with software versions that we will discuss in this chapter. Either of these strategies can be used alone or you can use both. They address different aspects of the computing environment discrepancy problem. 6.1.1 Strategy 1: Session Info - record a list of your packages One strategy to combat different software versions is to list the session info. This is the easiest (though not most comprehensive) method for handling differences in software versions is to have your code list details about your computing environment. Session info can lead to clues as to why results weren’t reproducible. For example, if both Avi and Ruby ran notebooks and included a session info print out it may look like this: Session info shows us that they have different R versions and different operating systems. The packages they have attached is rmarkdown but they also have different rmarkdown package versions. If Avi and Ruby have discrepancies in their results, the session info print out gives a record which may have clues for any discrepancies. This can give them items to look into for determining why the results didn’t reproduce as expected. 6.1.2 Strategy 2: Package managers - share a useable snapshot of your environment Package managers can help handle your computing environment for you in a way that you can share them with others. In general, package managers work by capturing a snapshot of the environment and when that environment snapshot is shared, it attempt to rebuild it. For R and Python versions of the exercises, we will be using different managers, but the foundational strategy will be the same: include a file that someone else could replicate your package set up from. For both exercises, we will download an environment ‘snapshot’ file we’ve set up for you, then we will practice adding a new package to the environments we’ve provided, and add them to your new repository along with the rest of your example project files. For Python, we’ll use conda for package management and store this information in a environment.yml file. For R, we’ll use renv for package management and store this information in a renv.lock file. 6.2 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 6.3 Exercise 1: Print out session info Python version of the exercise In your scientific notebook, you’ll need to add two items. 1. Add the import session_info to a code chunk at the beginning of your notebook. 2. Add session_info.show() to a new code chunk at the very end of your notebook. 2. Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. R version of the exercise In your Rmd file, add a chunk in the very end that looks like this: ```r sessionInfo() ``` ``` ## R version 4.0.2 (2020-06-22) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.3 LTS ## ## Matrix products: default ## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] knitr_1.33 magrittr_2.0.2 hms_0.5.3 R6_2.4.1 ## [5] rlang_0.4.10 highr_0.8 stringr_1.4.0 httr_1.4.2 ## [9] tools_4.0.2 xfun_0.26 jquerylib_0.1.4 htmltools_0.5.0 ## [13] ellipsis_0.3.1 ottrpal_0.1.2 yaml_2.2.1 digest_0.6.25 ## [17] tibble_3.0.3 lifecycle_1.0.0 crayon_1.3.4 bookdown_0.24 ## [21] readr_1.4.0 vctrs_0.3.4 fs_1.5.0 curl_4.3 ## [25] evaluate_0.14 rmarkdown_2.10 stringi_1.5.3 compiler_4.0.2 ## [29] pillar_1.4.6 pkgconfig_2.0.3 ``` Save your notebook as is. Note it will not run correctly until we address the issues with the code in the next chapter. 6.4 Exercise 2: Package management Python version of the exercise Download this starter conda environment.yml file by clicking on the link and place it with your example project files directory. Navigate to your example project files directory using command line. Create your conda environment by using this file in the command. conda env create --file environment.yml Activate your conda environment using this command. conda activate reproducible-python Now start up JupyterLab again using this command: jupyter lab Follow these instructions to add the environment.yml file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. 6.4.1 More resources on how to use conda Install Jupyter using your own environment (Mac specific) Definitive guide to using conda R version of the exercise First install the renv package Go to RStudio and the Console pane: Install renv using (you should only need to do this once per your computer or RStudio environment). install.packages("renv") Now set up renv to use in your project Change to your current directory for your project using setwd() in your console window (don’t put this in a script or notebook). Use this command in your project: renv::init() This will start up renv in your particular project *What’s :: about? – in brief it allows you to use a function from a package without loading the entire thing with library(). Now you can develop your project as you normally would; installing and removing packages in R as you see fit. For the purposes of this exercise, let’s install the styler package using the following command. (The styler package will come in handy for styling our code in the next chapter). install.packages("styler") Now that we have installed styler we will want to add it to our renv snapshot. To add any packages we’ve installed to our renv snapshot we will use this command: renv::snapshot() This will save whatever packages we are currently using to our environment snapshot file called renv.lock. This renv.lock file is what we can share with our collaborators so they can replicate our computing environment. If your package installation attempts are unsuccessful and you’d like to revert to the previous state of your environment, you can run renv::restore(). This will restore your renv.lock file to what it was before you attempted to install styler or whatever packages you tried to install. You should see an renv.lock file is now created or updated! You will want to always include this file with your project files. This means we will want to add it to our GitHub! Follow these instructions to add your renv.lock file to the GitHub repository you created in the previous chapter. Later we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. After you’ve added your computing environment files to your GitHub, you’re ready to continue using them with your IDE to actually work on the code in your notebook! Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["writing-durable-code.html", "Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.3 More reading on best coding practices 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) 7.5 Exercise 1: Make code more durable! 7.6 Exercise 2: Style code automatically!", " Chapter 7 Writing durable code 7.1 Learning Objectives 7.2 General principles 7.2.0.1 Work on your code iteratively Getting your code to work the first time is the first step, but don’t stop there! Just like in writing a manuscript you wouldn’t consider your first draft a final draft, your polishing code works best in an iterative manner. Although you may need to set it aside for the day to give your brain a rest, return to your code later with fresh eyes and try to look for ways to improve upon it! 7.2.0.2 Prioritize readability over cleverness Some cleverness in code can be helpful, too much can make it difficult for others (including your future self!) to understand. If cleverness comprises the readability of your code, it probably is not worth it. Clever but unreadable code won’t be re-used or trusted by others (AGAIN, including your future self!). What does readable code look like? Orosz (2019) has some thoughts on writing readable code: Readable code starts with code that you find easy to read. When you finish coding, take a break to clear your mind. Then try to re-read the code, putting yourself in the mindset that you know nothing about the changes and why you made them. Can you follow along with your code? Do the variables and method names help understand what they do? Are there comments at places where just the code is not enough? Is the style of the code consistent across the changes? Think about how you could make the code more readable. Perhaps you see some functions that do too many things and are too long. Perhaps you find that renaming a variable would make its purpose clearer. Make changes until you feel like the code is as expressive, concise, and pretty as it can be. The real test of readable code is others reading it. So get feedback from others, via code reviews. Ask people to share feedback on how clear the code is. Encourage people to ask questions if something does not make sense. Code reviews - especially thorough code reviews - are the best way to get feedback on how good and readable your code is. Readable code will attract little to no clarifying questions, and reviewers won’t misunderstand it. So pay careful attention to the cases when you realize someone misunderstood the intent of what you wrote or asked a clarifying question. Every question or misunderstanding hints to opportunities to make the code more readable. A good way to get more feedback on the clarity of your code is to ask for feedback from someone who is not an expert on the codebase you are working on. Ask specifically for feedback on how easy to read your code is. Because this developer is not an expert on the codebase, they’ll focus on how much they can follow your code. Most of the comments they make will be about your code’s readability. We’ll talk a bit more about code review in an upcoming chapter! More reading: Readable Code by Orosz (2019). Write clean R code by Dubel (2021). Python Clean Code: 6 Best Practices to Make Your Python Functions More Readable by Tran (2021). 7.2.0.3 DRY up your code DRY is an acronym: “Don’t repeat yourself” (Smith 2013). “I hate code, and I want as little of it as possible in our product.” – Diederich (2012) If you find yourself writing something more than once, you might want to write a function, or store something as a variable. The added benefit of writing a function is you might be able to borrow it in another project. DRY code is easier to fix and maintain because if it breaks, its easier to fix something in one place, than in 10 places. DRY code is easier on the reviewer because they don’t have to review the same thing twice, but also because they don’t have to review the same thing twice. ;) DRYing code is something that takes some iterative passes and edits through your code, but in the end DRY code saves you and your collaborators time and can be something you reuse again in a future project! Here’s an slightly modified example from Bernardo (2021) for what DRY vs non-DRY code might look like: paste('Hello','John', 'welcome to this course') paste('Hello','Susan', 'welcome to this course') paste('Hello','Matt', 'welcome to this course') paste('Hello','Anne', 'welcome to this course') paste('Hello','Joe', 'welcome to this course') paste('Hello','Tyson', 'welcome to this course') paste('Hello','Julia', 'welcome to this course') paste('Hello','Cathy', 'welcome to this course') Could be functional-ized and rewritten as: GreetStudent <- function(name) { greeting <- paste('Hello', name, 'welcome to this course') return(greeting) } class_names <- c('John', 'Susan', 'Matt' ,'Anne', 'Joe', 'Tyson', 'Julia', 'Cathy') lapply(class_names, GreetStudent) Now, if you wanted to edit the greeting, you’d only need to edit it in the function, instead of in each instance. More reading about this idea: DRY Programming Practices by Klinefelter (2016). Keeping R Code DRY with functions by Riffomonas Project (2021). Write efficient R code for science by Max Joseph (2017). Write efficient Python code by Leah Wasser (2019). Don’t repeat yourself: Python functions by Héroux (2018). 7.2.0.4 Don’t be afraid to delete and refresh a lot Don’t be afraid to delete it all and re-run (multiple times). This includes refreshing your kernel/session in your IDE. In essence, this is the data science version of “Have you tried turning it off and then on again?” Some bugs in your code exist or are not realized because old objects and libraries have overstayed their welcome in your environment. Why do you need to refresh your kernel/session? As a quick example of why refreshing your kernel/session, let’s suppose you are troubleshooting something that centers around an object named some_obj but then you rename this object to iris_df. When you rename this object you may need to update this other places in the code. If you don’t refresh your environment while working on your code, some_obj will still be in your environment. This will make it more difficult for you to find where else the code needs to be updated. Refreshing your kernel/session goes beyond objects defined in your environment, and also can affect packages and dependencies loaded or all kinds of other things attached to your kernel/session. As a quick experiment, try this in your Python or R environment: The dir() and ls() functions list your defined variables in your Python and R environments respectively. In Python: some_obj=[] dir() Now refresh your Kernel and re-run dir() dir() You should see you no longer have some_obj listed as being defined in your environment. In R some_obj <- c() ls() Now refresh your session and re-run ls() ls() You should see you no longer have some_obj listed as being defined in your environment. Keeping around old code and objects is generally more of a hindrance than a time saver. Sometimes it can be easy to get very attached to a chunk of code that took you a long time to troubleshoot but there are three reasons you don’t need to stress about deleting it: You might write better code on the second try (or third or n’th). Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need. With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!) This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it). Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself! 7.2.0.5 Use code comments effectively Good code comments are a part of writing good, readable code! Your code is more likely to stand the test of time for longer if others, including yourself in the future, can see what’s happening enough to trust it themselves. This will encourage others to use your code and help you maintain it! ‘Current You’ who is writing your code may know what is happening but ‘Future You’ will have no idea what ‘Current You’ was thinking (Spielman, n.d.): ‘Future You’ comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. Help out ‘Future You’ by adding lots of comments! ‘Future You’ next week thinks Today You is an idiot, and the only way you can convince ‘Future You’ that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad. Your code and your understanding of it will fade soon after you write it, leaving your hard work to deprecate. Code that works is a start, but readable AND working code is best! Comments can help clarify at points where your code might need further explanation. The best code comments explain the why of what you are doing. The act of writing them can also help you think out your thought process and perhaps identify a better solution to the odd parts of your code. (From Savonen (2021a)) More reading: Creating clarifying code comments Best Practices for Writing Code Comments by Spertus (2021). What Makes a Good Code Comment? by Cronin (2019). The Value of Code Documentation by Meza (2018). Some internet wisdom on R documentation by Frazee (2014). How to Comment Your Code Like a Pro: Best Practices and Good Habits by Keeton (2019). 7.2.0.6 Use informative variable names Try to avoid using variable names that have no meaning like tmp or x, or i. Meaningful variable names make your code more readable! Additionally, variable names that are longer than one letter are much easier to search and replace if needed. One letter variables are hard to replace and hard to read. Don’t be afraid of long variable names, they are very unlikely to be confused! 1 Write intention-revealing names. 2 Use consistent notation for naming convention. 3 Use standard terms. 4 Do not number a variable name. 5 When you find another way to name variable, refactor as fast as possible. (Hobert 2018) More reading: R for Epidemiology - Coding best Practices by Cannell (2021). Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them by Koehrsen (2019). Writing Variable — Informative, Descriptive & Elegant by Hobert (2018). 7.2.0.7 Follow a code style Just like when writing doesN”t FoLLOW conv3nTi0Ns OR_sPAcinng 0r sp3llinG, it can be distracting, the same goes for code. Your code may even work all the same, just like you understood what I wrote in that last sentence, but a lack of consistent style can make require more brain power from your readers for them to understand. For reproducibility purposes, readability is important! The easier you can make it on your readers, the more likely they will be able to understand and reproduce the results. There are different style guides out there that people adhere to. It doesn’t matter so much which one you choose, so much that you pick one and stick to it for a particular project. Python style guides: PEP8 style guide “PEP 8 – Style Guide for Python Code” (2021). Google Python style guide “Styleguide” (2021). R style guides: Hadley Wickham’s Style guide Wickham (2019). Google R style guide “Google’s R Style Guide” (2021). Although writing code following a style as you are writing is a good practice, we’re all human and that can be tricky to do, so we recommend using an automatic styler on your code to fix up your code for you. For Python code, you can use python black and for R, styler. 7.2.0.8 Organize the structure of your code Readable code should follow an organized structure. Just like how outlines help the structure of manuscript writing, outlines can also help the organization of code writing. A tentative outline for a notebook might look like this: A description of the purpose of the code (in Markdown). Import the libraries you will need (including sourcing any custom functions). List any hard-coded variables. Import data. Do any data cleaning needed. The main thing you need to do. Print out session info. Note that if your notebook gets too long, you may want to separate out things in their own scripts. Additionally, it’s good practice to keep custom functions in their own file and import them. This allows you to use them elsewhere and also keeps the main part of the analysis cleaner. 7.2.0.9 Set the seed if your analysis has randomness involved If any randomness is involved in your analysis, you will want to set the seed in order for your results to be reproducible. In brief, computers don’t actually create numbers randomly they create numbers pseudorandomly. But if you want your results to be reproducible, you should give your computer a seed by which to create random numbers. This will allow anyone who re-runs your analysis to have a positive control and eliminate randomness as a reason the results were not reproducible. For more on how setting the seed works – a quick experiment To illustrate how seeds work, run we’ll run a quick experiment with setting the seed here: First let’s set a seed (it doesn’t matter what number we use, just that we pick a number), so let’s use 1234 and then create a “random” number. # Set the seed: set.seed(1234) # Now create a random number again runif(1) ## [1] 0.1137034 Now if we try a different seed, we will get a different “random” number. # Set a different seed: set.seed(4321) # Now create a random number again runif(1) ## [1] 0.334778 But, if we return to the original seed we used, 1234, we will get the original “random” number we got. # Set this back to the original seed set.seed(1234) # Now we'll get the same "random" number we got when we set the seed to 1234 previously runif(1) ## [1] 0.1137034 More reading: Set seed by Soage (2020). Generating random numbers by Chang (2021). 7.2.0.10 To review general principles: 7.3 More reading on best coding practices There’s so many opinions and strategies on best practices for code. And although a lot of these principles are generally applicable, not all of it is one size fits all. Some code practices are context-specific so sometimes you may need to pick and choose what works for you, your team, and your particular project. 7.3.0.1 Python specific: Reproducible Programming for Biologists Who Code Part 2: Should Dos by Heil (2020). 15 common coding mistakes data scientist make in Python (and how to fix them) by Csendes (2020). Data Science in Production — Advanced Python Best Practices by Kostyuk (2020). 6 Mistakes Every Python Beginner Should Avoid While Coding by Saxena (2021). 7.3.0.2 R specific: Data Carpentry’s: Best Practices for Writing R Code by “Best Practices for Writing R Code – Programming with R” (2021). R Programming for Research: Reproducible Research by Good (2021). R for Epidemiology: Coding best practices by Cannell (2021). Best practices for R Programming by Bernardo (2021). 7.4 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 7.5 Exercise 1: Make code more durable! 7.5.1 Organize the big picture of the code Before diving in line-by-line it can be helpful to make a code-outline of sorts. What are the main steps you need to accomplish in this notebook? What are the starting and ending points for this particular notebook? For example, for this make-heatmap notebook we want to: Set up analysis folders and declare file names. Install the libraries we need. Import the gene expression data and metadata. Filter down the gene expression data to genes of interest – in this instance the most variant ones. Clean the metadata. Create an annotated heatmap. Save the heatmap to a PNG. Print out the session info! Python version of the exercise The exercise: Polishing code Start up JupyterLab with running jupyter lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you made in the previous chapter make-heatmap.ipynb Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In Jupyter, you refresh your environment by using the refresh icon in the toolbar or by going to Restart Kernel. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. random.seed(1234) Use a relative file path Rationale: Absolute file paths only work for the original writer of the code and no one else. But if we make the file path relative to the project set up, then it will work for whomever has the project repository (Mustafeez 2021). Additionally, we can set up our file path names using f-Strings so that we only need to change the project ID and the rest will be ready for a new dataset (Python 2021)! Although this requires more lines of code, this set up is much more flexible and ready for others to use. Before: df1=pd.read_csv('~/a/file/path/only/I/have/SRP070849.tsv', sep='\\t') mdf=pd.read_csv('~/a/file/path/only/I/have/SRP070849_metadata.tsv', sep='\\t') After: # Declare project ID id = "SRP070849" # Define the file path to the data directory data_dir = Path(f"data/{id}") # Declare the file path to the gene expression matrix file data_file = data_dir.joinpath(f"{id}.tsv") # Declare the file path to the metadata file # inside the directory saved as `data_dir` metadata_file = data_dir.joinpath(f"metadata_{id}.tsv") # Read in metadata TSV file metadata = pd.read_csv(metadata_file, sep="\\t") # Read in data TSV file expression_df = pd.read_csv(data_file, sep="\\t") Related readings: f-strings in Python by Geeks (2018). f-Strings: A New and Improved Way to Format Strings in Python by Python (2021). Relative vs absolute file paths by Mustafeez (2021). About join path by “Python Examples of Pathlib.Path.joinpath” (2021). Avoid using mystery numbers Rationale: Avoid using numbers that don’t have context around them in the code. Include the calculations for the number, or if it needs to be hard-coded, explain the rationale for that number in the comments. Additionally, using variable and column names that tell you what is happening, helps clarify what the number represents. Before: df1['calc'] =df1.var(axis = 1, skipna = True) df2=df1[df1.calc >float(10)] After: # Calculate the variance for each gene expression_df["variance"] = expression_df.var(axis=1, skipna=True) # Find the upper quartile for these data upper_quartile = expression_df["variance"].quantile([0.90]).values # Filter the data choosing only genes whose variances are in the upper quartile df_by_var = expression_df[expression_df.variance > float(upper_quartile)] Related readings: - Stop Using Magic Numbers and Variables in Your Code by Aaberge (2021). Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing, we didn’t check for this before. After: print(metadata["refinebio_accession_code"].tolist() == expression_df.columns.tolist()) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example Python repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R version of the exercise About the tidyverse:. Before we dive into the exercise, a word about the tidyverse. The tidyverse is a highly useful set of packages for creating readable and reproducible data science workflows in R. In general, we will opt for tidyverse approaches in this course, and strongly encourage you to familiarize yourself with the tidyverse if you have not. We will point out some instances where tidyverse functions can help you DRY up your code as well as make it more readable! More reading on the tidyverse: Tidyverse Skills for Data Science by Carrie Wright (n.d.). A Beginner’s Guide to Tidyverse by A. V. Team (2019). Introduction to tidyverse by Shapiro et al. (2021). The exercise: Polishing code Open up RStudio. Open up the notebook you created in the previous chapter. Now we’ll work on applying the principles from this chapter to the code. We’ll cover some of the points here, but then we encourage you to dig into the fully transformed notebook we will link at the end of this section. Work on organizing the code chunks and adding documentation to reflect the steps we’ve laid out in the previous section, you may want to work on this iteratively as we dive into the code. As you clean up the code, you should run and re-run chunks to see if they work as you expect. You will also want to refresh your environment to help you develop the code (sometimes older objectives stuck in your environment can inhibit your ability to troubleshoot). In RStudio, you refresh your environment by going to the Run menu and using Restart R and refresh clear output. Set the seed Rationale: The clustering in the analysis involves some randomness. We need to set the seed! Before: Nothing! We didn’t set the seed before! After: You can pick any number; doesn’t have to be 1234. set.seed(1234) Get rid of setwd Rationale: setwd() almost never work for anyone besides the one person who wrote it. And in a few days/weeks it may not work for them either. Before: setwd("Super specific/filepath/that/noone/else/has/") After: Now that we are working from a notebook, we know that the default current directory is wherever the notebook is placed (Xie, Dervieux, and Riederer 2020). Related readings: Jenny Bryan will light your computer on fire if you use setwd() in a script (Bryan 2017). Give the variables more informative names Rationale: xx doesn’t tell us what is in the data here. Also by using the readr::read_tsv() from tidyverse we’ll get a cleaner, faster read and won’t have to specify sep argument. Note we are also fixing some spacing and using <- so that we can stick to readability conventions. You’ll notice later Before: xx=read.csv("metadata_SRP070849.tsv", sep = "\\t") After: metadata <- readr::read_tsv("metadata_SRP070849.tsv") Related readings: readr::read_tsv() documentation by “Read a Delimited File (Including CSV and TSV) into a Tibble — Read_delim” (n.d.). DRYing up data frame manipulations Rationale: This chunk of code can be very tricky to understand what it is doing. What is happening with df1 and df2? What’s being filtered out? etc. Code comments would certainly help understanding, but even better, we can DRY this code up and make the code clearer on its own. Before: It may be difficult to tell from looking at the before code because there are no comments and it’s a bit tricky to read, but the goal of this is to: Calculate variances for each row (each row is a gene). Filter the original gene expression matrix to only genes have a bigger variance (here we use arbitrarily 10 as a filter cutoff). df=read.csv("SRP070849.tsv", sep="\\t") sums=matrix(nrow = nrow(df), ncol = ncol(df) - 1) for(i in 1:nrow(sums)) { sums[i, ] <- sum(df[i, -1]) } df2=df[which(df[, -1] >= 10), ] variances=matrix(nrow = nrow(dds), ncol = ncol(dds) - 1) for(i in 1:nrow(dds)) { variances[i, ] <- var(dds[i, -1]) } After: Let’s see how we can do this in a DRY’er and clearer way. We can: 1) Add comments to describe our goals. 2) Use variable names that are more informative. 3) Use the apply functions to do the loop for us – this will eliminate the need for unclear variable i as well. 4) Use the tidyverse to do the filtering for us so we don’t have to rename data frames or store extra versions of df. Here’s what the above might look like after some refactoring. Hopefully you find this is easier to follow and total there’s less lines of code (but also has comments too!). # Read in data TSV file expression_df <- readr::read_tsv(data_file) %>% # Here we are going to store the gene IDs as row names so that # we can have only numeric values to perform calculations on later tibble::column_to_rownames("Gene") # Calculate the variance for each gene variances <- apply(expression_df, 1, var) # Determine the upper quartile variance cutoff value upper_var <- quantile(variances, 0.75) # Filter the data choosing only genes whose variances are in the upper quartile df_by_var <- data.frame(expression_df) %>% dplyr::filter(variances > upper_var) Add checks Rationale: Just because your script ran without an error that stopped the script doesn’t mean it is accurate and error free. Silent errors are the most tricky to solve, because you often won’t know that they happened! A very common error is data that is in the wrong order. In this example we have two data frames that contain information about the same samples. But in the original script, we don’t ever check that the samples are in the same order in the metadata and the gene expression matrix! This is a really easy way to get incorrect results! Before: Nothing... we didn't check for this :( After: # Make the data in the order of the metadata expression_df <- expression_df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order all.equal(colnames(expression_df), metadata$refinebio_accession_code) Continue to try to apply the general advice we gave about code to your notebook! Then, when you are ready, take a look at what our “final” version looks like in the example R repository. (Final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) Now that we’ve made some nice updates to the code, we are ready to do a bit more polishing by adding more documentation! But before we head to the next chapter, we can style the code we wrote automatically by using automatic code stylers! 7.6 Exercise 2: Style code automatically! Styling Python code automatically Run your notebook through black. First you’ll need to install it by running this command in a Terminal window in your JupyterLab. Make sure you are running this within your conda environment. conda activate reproducible-python Now install this python black. pip install black[jupyter] To record your conda environment run this command. conda env export > environment-record.yml Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: python -m black make-heatmap.ipynb You should get a message that your notebook was styled! Styling R code automatically Let’s run your notebook through styler. First you’ll need to install it and add it to your renv. install.packages("styler") Then add it to your renv by running: renv::snapshot() Now you can automatically style your code by running this command from your Console (be sure to replace the \"make-heatmap.Rmd\" with whatever you have named your notebook: styler::style_file("make-heatmap.Rmd") You should get a message that your notebook was styled! Before you are done with this exercise, there’s one more thing we need to do: upload the latest version to GitHub. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! References "],["documenting-analyses.html", "Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) 8.4 Exercise 1: Practice beefing up your notebook descriptions 8.5 Exercise 2: Write a README for your project!", " Chapter 8 Documenting analyses 8.1 Learning Objectives 8.2 Why documentation? Documentation is an important but sometimes overlooked part of creating a reproducible analysis! There are two parts of documentation we will discuss here: 1) In notebook descriptions and 2) READMEs Both these notebook descriptions and READMEs are written in markdown – a shorthand for html (the same as the documentation parts of your code). If you aren’t familiar, markdown is such a handy tool and we encourage you to learn it (it doesn’t take too long), here’s a quick guide to get you started. 8.2.1 Notebook descriptions As we discussed in chapter 5, data analyses can lead one on a winding trail of decisions, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! Your scientific notebook should include descriptions that describe: 8.2.1.1 The purposes of the notebook What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? 8.2.1.2 The rationales behind your decisions Describe why a particular code chunk is doing a particular thing – the more odd the code looks, the greater need for you to describe why you are doing it. Describe any particular filters or cutoffs you are using and how did you decide on those? For data wrangling steps, why are you wrangling the data in such a way – is this because a certain package you are using requires it? 8.2.1.3 Your observations of the results What do you think about the results? The plots and tables you show in the notebook – how do they inform your original questions? 8.2.2 READMEs! READMEs are also a great way to help your collaborators get quickly acquainted with the project. READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called “README.md” when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. Information that should be included in a README: General purpose of the project Instructions on how to re-run the project Lists of any software required by the project Input and output file descriptions. Descriptions of any additional tools included in the project? You can take a look at this template README to get your started. 8.2.2.1 More about writing READMEs: How to write a good README file A Beginners Guide to writing a Kicka** README How to write an awesome README 8.3 Get the exercise project files (or continue with the files you used in the previous chapter) Get the Python project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. Get the R project example files Click this link to download. Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions. 8.4 Exercise 1: Practice beefing up your notebook descriptions Python project exercise Start up JuptyerLab with running juptyer lab from your command line. Activate your conda environment using conda activate reproducible-python. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.ipynb Create a new chunk in your notebook and choose the “Markdown” option in the dropdown menu. ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gfaa026a583_0_30") 5. Continue to add more descriptions where you feel is necessary, You can reference the descriptions we have in the “final” version looks like in the example Python repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) R project exercise Open up RStudio. Open up your notebook you’ve been working on in the previous chapters: make_heatmap.Rmd In between code chunks, add more descriptions using Markdown language. You can test how this renders by saving your .Rmd and then opening up the resulting nb.html file and choosing View in Browser. Continue to add more descriptions where you feel is necessary. You can reference the descriptions we have in the “final” version looks like in the example R repository. (Again, final here is in quotes because we may continue to make improvements to this notebook too – remember what we said about iterative?) 8.5 Exercise 2: Write a README for your project! Download this template README. Fill in the questions inside the { } to create a README for this project. You can reference the “final” versions of the README, but keep in mind it will reference items that we will discuss in the “advanced” portion of this course. See the R README here and the Python README here. Add your README and updated notebook to your GitHub repository. Follow these instructions to add the latest version of your notebook to your GitHub repository. Later, we will practice and discuss how to more fully utilize the features of GitHub but for now, just drag and drop it as the instructions linked describe. Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! "],["code-review.html", "Chapter 9 Code review 9.1 Learning Objectives 9.2 Exercise: Set up your code review request!", " Chapter 9 Code review 9.1 Learning Objectives We’ve previously discussed that the only way to know if your analysis is truly reproducible is to send it to someone else to reproduce! That sentiment is at the heart of code review. Although most wouldn’t dare send out a manuscript for publishing without having our collaborators giving it a line-by-line review, people don’t always feel the same way about code. Parker (2017) describes code review: Code review will not guarantee an accurate analysis, but it’s one of the most reliable ways of establishing one that is more accurate than before. Not only does code review help boost the accuracy and reproducibility of the analysis, it also helps everyone involved in the process learn something new! 9.1.0.1 Recommended reading about code review Code Review Guidelines for Humans by Hauer (2018). Your Code Sucks! – Code Review Best Practices by Hildebr (2020). Best practices for Code Review by S. Team (2021). Why code reviews matter (and actually save time!) by Radigan (2021). 9.2 Exercise: Set up your code review request! Since reproducibility is all about someone else being able to run your code and obtain your results, the exercise in this course involves preparing to do just that! The goal: In the second part of this reproducibility course we will discuss how to conduct formal line-by-line code reviews, but for now, we will discuss how to prep your analysis for someone else look at your code and attempt to run it. At this point, you should have a GitHub repository that contains the following: A make_heatmap notebook A README A data folder containing the metadata and gene expression matrix files in a folder named SRP070849: SRP070849/metadata_SRP070849.tsv SRP070849/SRP070849.tsv 1) Refresh and delete output Before you send off your code to someone else, delete your output (the results and plots folders) and attempt to re-run it yourself. This also involves restarting your R session/Python kernel and running all the chunks again. 2) Re-run the whole analysis 3) Interrogate and troubleshoot If your code has any issues running, try your best to troubleshoot the problems. Read this handy guide for tips on troubleshooting R. 4) Rinse and repeat Repeat this as many times as needed until you reliably are able to re-run this code and get the same results without any code smells popping up. Dig into bad code smells or bad results smells wherever you sense them. If you aren’t sure why you feel this way about your code or results, hold on to this and it may be something your collaborator will be able to see something you don’t. 5) Let it simmer Leave your analysis for a bit. Do you think it’s perfect? Are you at your wits end with it? No matter how you feel about it, let it sit for a half a day or so then return to it with fresh eyes (Savonen 2021b). 5) Re-review your documentation and code with fresh eyes Now with fresh eyes and doing your best to imagine you don’t have the knowledge you have – does your analysis and results make sense? 6) Are you sure it’s ready? Ask yourself if you’ve polished this code and documentation as far as you can reasonably take it? Realizing that determining what qualifies as far as you can reasonably take it is also a skill you will build with time. Code review is the most efficient use of everyone’s time when your code and documentation have reached this point. 8) Draft your request Now you are ready to send this code to your collaborator, but first try to send them a specific set of instructions and questions about what you would like them to review, in your message to them include this information (You may want to draft this out in a scratch file): Code review requests should include: A link to your repository that has your README to get them quickly oriented to the project. A request for what kind of feedback you are looking for. Big picture? Technical? Method selection? Are there specific areas of the code you are having trouble with or are unsure about? Send a link to the specific lines in GitHub you are asking about. Are there results that are surprising, confusing, or smell wrong? Be sure to detail what you have dug into and tried at this point for any problematic points. Explicitly ask them what commands or tests you’d like them to test run. Lastly, thank them for helping review your code! 9) Ready for review Now you are ready to send your crafted message to your collaborator for review. But, for the purposes of this exercise, you may not want to ask your collaborator to spend their time carefully review this practice repository, but now that you understand and have done the steps involved you are prepared to do this for your own analyses! TL;DR for asking for a code review: Any feedback you have regarding this exercise is greatly appreciated; you can fill out this form! In the second part of this course, we will discuss how to conduct code review through GitHub, further utilize version control, and more! References "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines. Credits Names Pedagogy Lead Content Instructor(s) Candace Savonen Lecturer(s) Candace Savonen Content Directors Jeff Leek, Sarah Wheelan Content Consultants [David Swiderski] Acknowledgments [Patrick O’Connell] Production Content Publisher Ira Gooding Content Publishing Reviewers Ira Gooding Technical Course Publishing Engineer Candace Savonen Template Publishing Engineers Candace Savonen, Carrie Wright Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal)John Muschelli, Candace Savonen, Carrie Wright Art and Design Illustrator Candace Savonen Figure Artist Candace Savonen Videographer Candace Savonen Videography Editor Candace Savonen Funding Funder National Cancer Institute (NCI) UE5 CA254170 Funding Staff Emily Voeglein, Fallon Bachman ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.0.2 (2020-06-22) ## os Ubuntu 20.04.3 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-03-25 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3) ## bookdown 0.24 2022-02-15 [1] Github (rstudio/bookdown@88bc4ea) ## callr 3.4.4 2020-09-07 [1] RSPM (R 4.0.2) ## cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.0) ## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0) ## desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.3) ## devtools 2.3.2 2020-09-18 [1] RSPM (R 4.0.3) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) ## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3) ## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3) ## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0) ## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3) ## glue 1.6.1 2022-01-22 [1] CRAN (R 4.0.2) ## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2) ## knitr 1.33 2022-02-15 [1] Github (yihui/knitr@a1052d1) ## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2) ## magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.0.2) ## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0) ## pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2) ## pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.3) ## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.3) ## processx 3.4.4 2020-09-03 [1] RSPM (R 4.0.2) ## ps 1.3.4 2020-08-11 [1] RSPM (R 4.0.2) ## purrr 0.3.4 2020-04-17 [1] RSPM (R 4.0.3) ## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0) ## remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.3) ## rlang 0.4.10 2022-02-15 [1] Github (r-lib/rlang@f0c9be5) ## rmarkdown 2.10 2022-02-15 [1] Github (rstudio/rmarkdown@02d3c25) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3) ## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3) ## stringr 1.4.0 2019-02-10 [1] RSPM (R 4.0.3) ## testthat 3.0.1 2022-02-15 [1] Github (R-lib/testthat@e99155a) ## usethis 2.1.5.9000 2022-02-15 [1] Github (r-lib/usethis@57b109a) ## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2) ## xfun 0.26 2022-02-15 [1] Github (yihui/xfun@74c2a66) ## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library David Swiderski Patrick O’Connell "],["references.html", "References", " References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]
diff --git a/docs/using-notebooks.html b/docs/using-notebooks.html
index 3e607888..f6c3814d 100644
--- a/docs/using-notebooks.html
+++ b/docs/using-notebooks.html
@@ -347,14 +347,14 @@ Get the exercise project file
Get the Python project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
Get the R project example files
Click this link to download.
-Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions).
+Now double click your chapter zip file to unzip. For Windows you may have to follow these instructions.
@@ -512,13 +512,18 @@
References
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018.
R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC.
https://bookdown.org/yihui/rmarkdown.
+
+
+
+
+
+
+
+
diff --git a/docs/writing-durable-code.html b/docs/writing-durable-code.html
index dc879b54..f0dae65b 100644
--- a/docs/writing-durable-code.html
+++ b/docs/writing-durable-code.html
@@ -421,7 +421,7 @@ Don’t be afraid to dele
Keeping around old code makes it harder for you to write and troubleshoot new better code – it’s easier to confuse yourself. Sometimes a fresh start can be what you need.
-With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!)
+With version control you can always return to that old code! (We’ll dive more into version control later on, but you’ve started the process by uploading your code to GitHub in chapter 4!)
This means you should not comment out old code. Just delete it! No code is so precious that you need to keep it commented out (particularly if you are using version control and you can retrieve it in other ways should you need it).
Related to this, if you want to be certain that your code is reproducible, it’s worth deleting all your output, and re-running everything with a fresh session. The first step to knowing if your analysis is reproducible is seeing if you can repeat it yourself!
@@ -568,7 +568,7 @@ Python specific:
+
+
+
+
+
+
+
+