Merge pull request #59 from robmoss/section/project-structure

Add Project structure and Writing code sections
robmoss · Aug 7, 2024 · 7fdde03 · 7fdde03
2 parents a194431 + 6a63066
commit 7fdde03
Show file tree

Hide file tree

Showing 19 changed files with 607 additions and 6 deletions.
diff --git a/docs/guides/README.md b/docs/guides/README.md
@@ -8,8 +8,12 @@ These materials are divided into the following sections:
 
 3. Using Git to [collaborate with colleagues](./collaborating/README.md) in a precisely controlled and manageable way.
 
-4. Ensuring that your research is [reproducible by others](./reproducibility/README.md).
+4. Learn how to [structure your project](./project-structure/README.md) so that it is easier for yourself and others to navigate.
 
-5. Using [testing frameworks](./testing/README.md) to verify that your code behaves as intended, and to automatically detect when you introduce a bug or mistake into your code.
+5. Learn how to [write code](./writing-code/README.md) so that it clearly expresses your intent and ideas.
 
-6. Running your code on various [computing platforms]() that allow you to obtain results efficiently and without relying on your own laptop/computer.
+6. Ensuring that your research is [reproducible by others](./reproducibility/README.md).
+
+7. Using [testing frameworks](./testing/README.md) to verify that your code behaves as intended, and to automatically detect when you introduce a bug or mistake into your code.
+
+8. Running your code on various [computing platforms]() that allow you to obtain results efficiently and without relying on your own laptop/computer.
diff --git a/docs/guides/learning-objectives.md b/docs/guides/learning-objectives.md
@@ -51,3 +51,25 @@ After completing [this section](collaborating/README.md), you should be able to:
 - Use a pull request to **merge a collaborator's work** into your main branch; and
 
 - **Conduct peer code review** in a respectful manner.
+
+## Project structure
+
+After completing [this section](project-structure/README.md), you should be able to:
+
+- Understand how to structure a new project;
+
+- Understand how to separate "what to do" from "how to do it"; and
+
+- Structure your code to enable new experiments and analyses.
+
+## Writing code
+
+After completing [this section](writing-code/README.md), you should be able to:
+
+- Divide your code into functions and modules;
+
+- Ensure that your code is a clear expression of your ideas;
+
+- Structure your code into reusable packages; and
+
+- Take advantage of code formatters and code linters.
diff --git a/docs/guides/project-structure/README.md b/docs/guides/project-structure/README.md
@@ -0,0 +1,16 @@
+# Project structure
+
+How we choose to structure a project can affect how readily someone else — or even yourself, after a period of absence — can understand, use, and extend the work.
+
+!!! question
+
+    Have you ever looked at your old code and wondered how it worked or how to make it run?
+
+!!! tip
+
+    A good project structure can serve as a table of contents and help the reader to navigate.
+
+In an earlier section we provided some guidelines for [how to structure a repository](../using-git/how-to-structure-a-repository.md).
+In this section we present further guidelines and examples to help you choose a sensible structure for your current project and future projects.
+
+This includes high-level recommendations that should apply to any project, and more detailed recommendations that may be specific to a particular type of project or choice of programming language.
diff --git a/docs/guides/project-structure/automating-tasks.md b/docs/guides/project-structure/automating-tasks.md
@@ -0,0 +1,25 @@
+# Automate common tasks
+
+If you reach the point where you need to run a specific sequence of commands or actions to achieve something — e.g., running a model simulation, or producing an output figure — it is a **very good idea** to write a script that performs all of these actions correctly.
+
+This is because while you may remember exactly what needs to be done **right now**, you may not remember next week, or next month, or next year.
+We're all human, and we all make mistakes, but these kinds of mistakes are **easy to avoid**!
+
+!!! info
+
+    Mistakes are a fact of life. It is the response to the error that counts.
+
+    — Nikki Giovanni
+
+There are many tools that can help you to automate tasks, some of which are smart enough that they will only do as little as possible (e.g., avoid re-running steps if the inputs have not changed).
+
+There are popular tools aimed at specific programming languages, such as:
+
+- **R:** [targets](https://books.ropensci.org/targets/);
+
+- **Python**: [nox](https://nox.thea.codes/) and [tox](https://tox.wiki/); and
+
+- **Julia:** [pipelines](https://juliapackages.com/p/pipelines).
+
+There are many generic automation tools (see, e.g., Wikipedia's list of [build automation software](https://en.wikipedia.org/wiki/List_of_build_automation_software)), although these can be rather complex to learn.
+We recommend using a language-specific automation tool where possible, and only using a generic automation tool as a last resort.
diff --git a/docs/guides/project-structure/exercise-a-good-readme.md b/docs/guides/project-structure/exercise-a-good-readme.md
@@ -0,0 +1,15 @@
+# Exercise: a good README
+
+Remember that the README file (usually one of `README.md`, `README.rst`, or `README.txt`) is often **the very first thing** that a user will see when looking at your project.
+
+- Have you seen any README files that were particularly helpful, or were not very helpful?
+
+- What information do you find helpful in a README file?
+
+Consider the `README.md` file in the [Australian 2020 COVID-19 forecasts repository](https://gitlab.unimelb.edu.au/rgmoss/aus-2020-covid-forecasts).
+
+- What content, if any, would you **add** to this file?
+
+- What content, if any, would you **remove** from this file?
+
+- Would you change its structure in any way?
diff --git a/docs/guides/project-structure/exercise-what-works-for-you.md b/docs/guides/project-structure/exercise-what-works-for-you.md
@@ -0,0 +1,9 @@
+# Exercise: what works for you?
+
+Look back at your past projects and identify aspects of their structure that you have found helpful.
+
+- What features or choices have worked well in past projects and might help you structure your future projects?
+
+- What problems or issues have you experienced with the structure of your past projects, which you could avoid in your future projects?
+
+- Can any of your colleagues and collaborators share similar insights?
diff --git a/docs/guides/project-structure/explain-how-it-works.md b/docs/guides/project-structure/explain-how-it-works.md
@@ -0,0 +1,46 @@
+# Explain how it all works
+
+Once you've chosen a project structure, you need to write down **how it all works** — regardless of how simple and clear your project structure is!
+
+!!! tip
+
+    The best place to do this is in a `README.md` file (or equivalent) in the project root directory.
+
+Begin with an overview of the project:
+
+- What question(s) are you trying to address?
+
+- What data, hypotheses, methods, etc, are you using?
+
+- What outputs does this generate?
+
+You can then provide further detail, such as:
+
+- What software environment and/or packages must be available for your code to run?
+
+- How can the user generate each of the outputs?
+
+- What license [have you chosen](../using-git/choosing-a-license.md)?
+
+
+## An example README.md
+
+See the [Australian 2020 COVID-19 forecasts repository](https://gitlab.unimelb.edu.au/rgmoss/aus-2020-covid-forecasts) for an example `README.md` file.
+
+This repository was used to generate the results, tables, and figures presented in the paper "[Forecasting COVID-19 activity in Australia to support pandemic response: May to October 2020](https://doi.org/10.1038/s41598-023-35668-6)", *Scientific Reports* 13, 8763 (2023).
+
+**Strengths:**
+
+- It includes installation and usage instructions;
+
+- It identifies the paper; and
+
+- It identifies the license under which the code is distributed.
+
+**Weaknesses:**
+
+- It only explains some of the project structure.
+
+- It doesn't provide an overview of the project, it only links to the paper.
+
+- The root directory contains a number of scripts and input files that aren't described.
diff --git a/docs/guides/project-structure/workflow.md b/docs/guides/project-structure/workflow.md
@@ -0,0 +1,70 @@
+# Define your workflow
+
+A good first step in deciding how to structure a project is to ask yourself:
+
+- What are the different project phases?
+
+- What are the major activities in each phase?
+
+## An example of phases and activities
+
+For example, a project might involve the following phases:
+
+1. Clean an existing data set;
+
+2. Build models with different hypotheses or features;
+
+3. Fit each model to the data; and
+
+4. Decide which model best explains the data.
+
+The data-cleaning phase might involve the following activities:
+
+- Obtain the raw data;
+
+- Identify the quality checks that should be applied;
+
+- Decide how to resolve data that fail each quality check; and
+
+- Generate and record the cleaned data.
+
+The model-building phase might involve the following activities:
+
+- Perform a literature search to identify relevant modelling studies;
+
+- Identify competing hypotheses or features that might explain the data;
+
+- Design a model that implements each hypothesis; and
+
+- Define the relationship between each model and the cleaned data.
+
+## Reflect this workflow in your project structure
+
+You can use the phases and activities to guide your choice of directory structure.
+For this example project, one possible structure is:
+
+- `project/`: the root directory of your project
+
+    - `input/`: a sub-directory that contains input data;
+
+        - `raw/`: the raw data **before** cleaning;
+
+        - `cleaned/`: the cleaned data;
+
+    - `code/`: a sub-directory that contains the project code;
+
+        - `cleaning/`: the data cleaning code;
+
+        - `model-first-hypothesis/`: the first model;
+
+        - `model-second-hypothesis/`: the second model;
+
+        - `fitting/`: the code that fits each model to the data;
+
+        - `evaluation/`: the code the compares the model fits;
+
+        - `plotting/`: the code that plots output figures;
+
+    - `paper/`: a sub-directory for the project manuscript;
+
+        - `figures/`: the output figures;
diff --git a/docs/guides/writing-code/README.md b/docs/guides/writing-code/README.md
@@ -0,0 +1,13 @@
+# Writing code
+
+For computational research, code is an important scientific artefact for the author, for colleagues and collaborators, and for the scientific community.
+It is the **ultimate form** of expressing **what you did** and **how you did it**.
+With good version control and documentation practices, it can also capture **when and why** you made important decisions.
+
+!!! tip
+
+    [W]e want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for **expressing ideas about methodology**.
+    Thus, programs must be **written for people to read**, and only incidentally for machines to execute.
+
+    — [Structure and Interpretation of Computer Programs](https://mitpress.mit.edu/9780262510875/).
+    Abelson, Sussman, and Sussman, 1984.
diff --git a/docs/guides/writing-code/behave-nicely.md b/docs/guides/writing-code/behave-nicely.md
@@ -0,0 +1,42 @@
+# Behave nicely
+
+Would you feel comfortable running someone else's code if you thought it might affect your other files, applications, settings, or do something else that's unexpected?
+
+!!! tip
+
+    Your code should be **encapsulated:** it should assume as little as possible about the computer on which it is running, and it shouldn't mess with the user's environment.
+
+!!! tip
+
+    Your code should follow the **principal of least surprise:** behave in a way that most users will expect it to behave, and not astonish or surprise them.
+
+## A cake analogy
+
+Suppose you have two colleagues who regularly bake cakes, and you decide you'd like one of them to bake you a chocolate cake.
+
+- **A nice colleague:**
+
+    - That evening, they go home and bake a cake.
+    - They bring the cake to work the next day.
+    - The cake tastes of chocolate.
+
+- **A messy colleague:**
+
+    - They bring the ingredients and a portable oven into your office.
+    - They make a huge mess, splattering your desk and computer.
+    - The oven is noisy and makes the office uncomfortably warm.
+    - The cake tastes of vanilla, not chocolate.
+
+## Some specific tips
+
+- Avoid modifying files outside of the project directory!
+
+- Avoid using hard-coded absolute paths, such as `C:\Users\My Name\Some Project\...` or `/Users/My Name/Some other directory`.
+  These make it harder for other people to use the code, or to run the code on high-performance computing platforms.
+
+- Prefer using paths that are relative to the root directory of your project, such as `input-data/case-data/cases-for-2023.csv`.
+  If you're using R, the [here](https://here.r-lib.org/) package is extremely helpful.
+
+- Warn the user before running tasks that take a long time to complete.
+
+- Notify the user before downloading large files.
diff --git a/docs/guides/writing-code/check-your-code.md b/docs/guides/writing-code/check-your-code.md
@@ -0,0 +1,14 @@
+# Check your code
+
+A "linter" is a tool that checks your code for syntax errors, possible mistakes, inconsistent formatting, and other potential issues.
+
+We **strongly recommend** using an editor that displays linter warnings as you write your code.
+Having instant feedback allows you to rapidly resolve many common issues and substantially improve your code.
+
+We list here some of the most commonly used linters:
+
+- **R:** [lintr](https://lintr.r-lib.org/)
+
+- **Python:** [ruff](https://docs.astral.sh/ruff/)
+
+- **Julia:** [Lint.jl](https://lintjl.readthedocs.org/en/stable/)
diff --git a/docs/guides/writing-code/coding-advice.md b/docs/guides/writing-code/coding-advice.md
@@ -0,0 +1,30 @@
+# Coding advice
+
+- Think about how to cleanly structure your code.
+  Take a **similar approach to how we write papers and grants**.
+
+- Break the overall problem into pieces, and then decide how to structure each piece in turn.
+
+- Divide your code into functions that each do one "thing", and group related functions into separate files or modules.
+
+- It can sometimes help to think about how you want the final code to look, and then design the functions and components that are needed.
+
+- Avoid global variables, aim to pass everything as function arguments.
+  This makes the code more robust and easier to run.
+
+- Avoid passing lots of individual parameters as separate arguments, this is prone to error — you might not pass them in the correct order.
+  Instead, collect the parameters into a single structure (e.g, a Python dictionary, an R named list).
+
+- Avoid making multiple copies of a model if you want to change some aspect of its behaviour.
+  Instead, add a new model parameter that enables/disables this new behaviour.
+  This allows you to use the same code to run the older and newer versions of the model.
+
+- Try to collect common or related tasks into a single script, and allow the user to select which task(s) to run, rather than creating many scripts that perform very similar tasks.
+
+- Write test cases to check key model properties.
+
+    - You want to identify problems and mistakes as soon as possible!
+
+    - Thinking about how to make your code testable can help you improve its structure!
+
+    - Well-written tests can also demonstrate **how to use your code**!
diff --git a/docs/guides/writing-code/cohesion-coupling.md b/docs/guides/writing-code/cohesion-coupling.md
@@ -0,0 +1,58 @@
+# Cohesion and coupling
+
+**Divide your code** into modules, each of which does one thing ("high cohesion") and depends as little as possible on other pieces ("low coupling").
+
+## Common project components
+
+For example, an infectious diseases modelling project might often be divided into some of the following components:
+
+- The model parameters — what are their values or prior distributions?
+
+- The initial model state — how is this created from the model parameters?
+
+- The model equations or update rules — how does the model evolve over time?
+
+- Summary statistics — what do you want to record for each simulation?
+  This might be the entire state history, a subset of the history, some aggregate statistics, or any combination of these things.
+
+- The input data (if any) — these may be case data, serological data, within-host specimen counts, etc.
+
+- The relationship between data and the model state ("observation model").
+
+- Simulated data generated from a model simulation.
+
+As much as possible, each of these components (where relevant to your project) should be represented as **a separate piece of code**.
+
+## Separating the "what" from the "how"
+
+Dividing your code into separate components is especially important if you want to use a model for multiple purposes, such as:
+
+- Exploring different scenarios;
+- Fitting to various data sets;
+- Performing sensitivity and uncertainty analyses; and
+- Forecasting future data.
+
+!!! tip
+
+    In particular, keep the following aspects of your project separate:
+
+    - **What to do:** fitting to different data sets, exploring different scenarios, performing a sensitivity analysis, etc; and
+
+    - **How to do it:** the model implementation.
+
+    If you want to explore a range of model scenarios, for example, define the parameter values (or sampling distributions) for each scenario in a separate input file.
+    Then write a script that takes an input file name as an argument, reads the parameter values, and uses these values to run the model simulations.
+
+    This makes it extremely simple to define and run new scenarios without modifying your code.
+
+## Interactions between components
+
+Choosing how your components interact (e.g., by calling functions or passing data) is **just as important** as deciding how to divide your code into components.
+
+Here are some key recommendations from [Object-Oriented Software Construction (2nd ed)](https://bertrandmeyer.com/OOSC2/):
+
+- Small interfaces: if two modules communicate, they should exchange as little information as possible.
+
+- Explicit interfaces: if two modules communicate, it should be obvious from the code in one or both of these modules.
+
+- Self documentation: strive to make all information about a module part of the module itself.