Skip to content
Haylee Ham edited this page Dec 15, 2020 · 42 revisions

LISH Lab Manual

Computing is an essential part of the research we do at LISH. Good computing practices therefore help us to meet the scientific standards that we aspire to. However, project management can be a complex and daunting process, sometimes seeming to require costly investment in learning new tools with vastly more functionality than the typical project requires.

This iteration of the LISH Lab Manual recognizes that tension and focuses on "Good Enough Practices in Scientific Computing". The goal is to enumerate practices and standards that any researcher can adopt with minimal difficulty and immediate payoffs for yourself and your collaborators (especially your future self!).

Change, big or small, can be a pain, especially when it comes to something as personal as one's workflow. This is why you'll notice that most, if not all, of the guidelines listed here do not require you to learn any new computing tools. They are focused on the organization of projects - like making sure that items are clearly named and placed where you'd expect to find them. While part of the goal here is to make life easier for the people who use your code or data, hopefully you'll find that this makes your life and research a little bit easier too 🙂

Projects

A project is a mid-to-long-term endeavor, such as writing a research paper or building a substantial new dataset. Conversely, this does not include exploratory analysis or projects that are expected to have a short lifespan. Each project has one Github repository, and each repository has one project. Follow the following guidelines when creating a new project repository.

  1. When creating a new repo, default to making the repository private. You may be working with sensitive data or code that is not yet ready for public consumption. You may make the repo public later on. Note that when you create a private repo, members of the LISH organization can still clone, pull, and commit to your repo.
  2. Each project repository is required to have a README.md. A README needs to have basic information including the purpose of the projects, how to run the code, and where to find the data. You can use this template.

Folder Structure

The following outlines a basic folder structure for all projects.

  1. If a project exists on your computer, it should have its own directory
  2. Every project should have a folder for:
    • data, with subdirectories for raw data and intermediate data (e.g. intermediate,processed)
    • code - e.g. scripts, src, code, R, stata
    • results such as figures or tables that you may include in your documents
    • documents such as manuscripts, notebooks, or presentations
  3. Use additional subdirectories as needed to organize your data or results (e.g. you might want to organize raw data based on time of collection)
  4. The repository should also include a .gitignore file. This file will list file names, folder names, or extensions that you do not want to push to GitHub. This is great for excluding data, since we never want to store large amounts (or really any amount) of data on GitHub. It is also a great way to avoid pushing sensitive information, such as .env files. Read more about .gitignore files here.

Code

  1. Start every script with a brief comment explaining what it does
  2. Use meaningful names for variables and functions, and abbreviations only when you are confident a reader will understand them (e.g. income_percapita or income_percap but not income_pc)
  3. It's not always possible to write code that's easily readable. In such situations, write succinct comments. The goal is that you should be able to understand your code if you came back to it after 6 months of not looking at it
  4. Use relative and not absolute directory paths
  5. Write modular code - decompose your code into functions or scripts that have clearly-defined inputs and outputs. Rule of thumb: keep each script to 100-200 lines and if it is longer, ask yourself if it can be broken down into more digestable chunks
  6. Write unit tests to do automatic checks that your code is doing what you think it is
  7. Make dependencies and requirements explicit

Here are some basic language-specific guidelines.

Python

  • The repository should include a requirements.txt file. This file will list all of the libraries that are required to run the code in your repo. You should also pin the version of the package you are using. For example, python-dateutil==2.8.1. For more details on writing and installing from a requirements.txt file, see this page.

R

  • If the primary language for the project is R, use RStudio Projects to keep your project self-contained (RStudio Projects and the here package make it easy to work with relative directories in R)
  • Load packages at the very beginning of your workflow with a clearly named script (e.g. 01_load-packages.R)

Stata

  • For additional useful coding tips and helpful links, click here for Julian Reif's Stata Coding Guide.

File Names

  • Machine-readable

    • Avoid spaces, punctuation, accented characters. Don't use upper/lower case to distinguish file names (e.g. "Foo" and "foo")
    • Underscore "_" to delimit units of metadata and hyphen "-" to delimit words so your eyes don't bleed
    • Good: 2020-03-08_nasa-jet-engine-contest.csv, 01_load-flight-data.R; Bad: weiyang should not use spaces.png, thesewordsneedtobeseparated.ipynb
  • Human-readable

    • Easy to figure out what something is just by looking at its name
    • Good:01_remove-duplicates.R, 02_standardize-last-names.R; Bad: 01_clean-data.R, 02_clean-data.R
  • Plays well with default ordering

    • Put something numeric first - either date or logical ordering
    • If dates, use YYYY-MM-DD (ISO 8601 standard). 2020-03-08_nasa-jet-engine-contest.csv not 03-08-2020.nasa-jet-engine-contest.csv;
    • If non-date number, left pad with zeros e.g. 04_filter-europe.R not 4_filter-europe.R. Otherwise the computer will arrange the files as 10.R, 1.R, 2.R instead of 01.R, 02.R,..., 10.R

Inspired by Jenny Bryan's "How to name files" slides

Wanna get fancy?

If you'd like to explore the use of more sophisticated tools to manage your code and data, feel free to do so! These aren't part of the standard requirements at LISH yet, but you are more than welcome to pick them up and teach them to the rest of us! Just remember to make it clear in your README how they fit into your workflow.

Here are some tools you might wanna explore:

Data

GitHub does not allow tracking of files over 100MB. In most cases, all data should be stored externally and the project README should include a detailed description of how to find and access the data for the project.

Clone this wiki locally