diff --git a/pr-preview/pr-75/community/training/debugging/example-square-numbers/index.html b/pr-preview/pr-75/community/training/debugging/example-square-numbers/index.html index 6f8b1ffb..395d1057 100644 --- a/pr-preview/pr-75/community/training/debugging/example-square-numbers/index.html +++ b/pr-preview/pr-75/community/training/debugging/example-square-numbers/index.html @@ -3530,6 +3530,10 @@
Manual breakpoints
+You can also create breakpoints in your code by calling breakpoint()
for Python, and browser()
for R.
Interactive debugger sessions
If your editor supports running a debugger, use this feature! diff --git a/pr-preview/pr-75/search/search_index.json b/pr-preview/pr-75/search/search_index.json index a1123ea2..1aa6ab5d 100644 --- a/pr-preview/pr-75/search/search_index.json +++ b/pr-preview/pr-75/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"
These materials aim to support early- and mid-career researchers (EMCRs) in the SPECTRUM and SPARK networks to develop their computing skills, and to make effective use of available tools1 and infrastructure2.
"},{"location":"#structure","title":"Structure","text":"Start with the basics: Our orientation tutorials provide overviews of essential skills, tools, templates, and suggested workflows.
Learn more about best practices: Our topical guides explain a range of topics and provide exercises to test your understanding.
Come together as a community: Our Community of Practice is how we come together to share skills, knowledge, and experience.
"},{"location":"#motivation","title":"Motivation","text":"Question
Why dedicate time and effort to learning these skills? There are many reasons!
The overall aim of these materials is help you conduct code-driven research more efficiently and with greater confidence.
Hopefully some of the following reasons resonate with you.
Fearlessly modify your code, knowing that your past work is never lost, by using version control with git.
Verify that your code behaves as expected, and get notified when it doesn't, by writing tests.
Ensure that your results won't change when running on a different computer by \"baking in\" reproducibility.
Improve your coding skills, and those of your colleagues, by working collaboratively and making use of peer code review.
Run your code quickly, and without relying on your own laptop or computer, by using high-performance computing.
Foundations of effective research
A piece of code is often useful beyond a single project or study.
By applying the above skills in your research, you will be able to easily reproduce past results, extend your code to address new questions and problems, and allow others to build on your code in their own research.
The benefits of good practices can continue to pay off long after the project is finished.
"},{"location":"#license","title":"License","text":"This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Such as version control and testing frameworks.\u00a0\u21a9
Such as the ARDC Nectar Research Cloud and Spartan.\u00a0\u21a9
Here is a list of the contributors who have helped develop these materials:
If you've made use of Git in your research activities, please let us know! We're looking for case studies that highlight how EMCRs are using Git. See the instructions for suggesting new content (below).
"},{"location":"how-to-contribute/#provide-comments-and-feedback","title":"Provide comments and feedback","text":"The easiest way to provide comments and feedback is to create an issue. Note that this requires a GitHub account. If you do not have a GitHub account, you can email any of the authors. Please include \"Git is my lab book\" in the subject line.
"},{"location":"how-to-contribute/#suggest-modifications-and-new-content","title":"Suggest modifications and new content","text":"This book is written in Markdown and is published using Material for MkDocs. See the Material for MkDocs Reference for an overview of the supported features.
You can suggest modifications and new content by:
Forking the book repository;
Adding, deleting, and/or modifying book chapters in the docs/
directory;
Recording your changes in one or more git commits; and
Creating a pull request, so that we can review your suggestions.
Info
You can also edit any page by clicking the \"Edit this page\" button () in the top-right corner. This will start the process described above by forking the book repository.
Tip
When editing Markdown content, please start each sentence on a separate line. Also check that your text editor removes trailing whitespace.
This ensures that each commit will contain only the modified sentences, and makes it easier to inspect the repository history.
Tip
When you add a new page, you must also add the page to the nav
block in mkdocs.yml
.
You can display content in multiple tabs by using ===
. For example:
=== \"Python\"\n\n ```py\n print(\"Hello world\")\n ```\n\n=== \"R\"\n\n ```R\n cat(\"Hello world\\n\")\n ```\n\n=== \"C++\"\n\n ```cpp\n #include <iostream>\n\n int main() {\n std::cout << \"Hello World\";\n return 0;\n }\n ```\n\n=== \"Shell\"\n\n ```sh\n echo \"Hello world\"\n ```\n\n=== \"Rust\"\n\n ```rust\n fn main() {\n println!(\"Hello World\");\n }\n ```\n
produces:
PythonRC++ShellRustprint(\"Hello world\")\n
cat(\"Hello world\\n\")\n
#include <iostream>\n\nint main() {\n std::cout << \"Hello World\";\n return 0;\n}\n
echo \"Hello world\"\n
fn main() {\n println!(\"Hello World\");\n}\n
"},{"location":"how-to-contribute/#adding-terminal-session-recordings","title":"Adding terminal session recordings","text":"You can use asciinema to record a terminal session, and display this recorded session with a small amount of HTML and JavaScript. For example, the following code is used to display the where-did-this-line-come-from.cast
recording in a tab called \"Video demonstration\", as shown in Where did this line come from? chapter:
=== \"Video demonstration\"\n\n <div id=\"demo\" data-cast-file=\"../where-did-this-line-come-from.cast\"></div>\n
You can also add links that jump to specific times in the video. Each link must have:
data-video
attribute that identifies the video (in the example above, this is \"demo\"
);data-seek-to
attribute that identifies the time (in seconds) to jump to; andhref
attribute that is set to \"javascript:;\"
(so that the link doesn't scroll the page).For example, the following code is used to display the video recording on the Choosing your Git Editor:
=== \"Git editor example\"\n\n <div id=\"demo\" data-cast-file=\"../git-editor-example.cast\"></div>\n\n Video timeline:\n\n 1. <a data-video=\"demo\" data-seek-to=\"4\" href=\"javascript:;\">Overview</a>\n 2. <a data-video=\"demo\" data-seek-to=\"17\" href=\"javascript:;\">Show how to use nano</a>\n 3. <a data-video=\"demo\" data-seek-to=\"71\" href=\"javascript:;\">Show how to use vim</a>\n
You can use the asciinema-scripted tool to generate scripted recordings.
"},{"location":"community/","title":"Community of Practice","text":"Info
Communities of Practice are groups of people who share a concern or a passion for something they do and learn how to do it better as they interact regularly.
The community acts as a living curriculum and involves learning on the part of everyone.
The aim of a Community of Practice (CoP) is to come together as a community and engage in a process of collective learning in a shared domain. The three characteristics of a CoP are:
Community: An environment for learning through interaction;
Practice: Specific knowledge shared by community members; and
Domain: A shared interest, problem, or concern.
We regularly meet as a community, report meeting summaries, and collect case studies that showcase good practices.
"},{"location":"community/#training-events","title":"Training events","text":"To support skill development, we have the capacity to prepare and deliver bespoke training events as standalone session and as part of larger meetings and conferences. See our Training events page for further details.
"},{"location":"community/case-studies/","title":"Case studies","text":"This section contains interesting and useful examples of incorporating Git into a research activity, as contributed by EMCRs in our network.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/","title":"Pen and paper - a less user-friendly form of version control than Git","text":"Author: Trish Campbell (patricia.campbell@unimelb.edu.au
)
Project: Pertussis modelling
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#the-problem","title":"The problem","text":"In this project, I developed a compartmental model of pertussis to determine appropriate vaccination strategies. While plotting some single model simulations, I noticed anomalies in the modelled output for two experiments. The first experiment had an order of magnitude more people in the infectious compartments than in the second experiment, even though there seemed to be far fewer infections occurring. This scenario did not fit with the parameter values that were being used. In the differential equation file for my model, in addition to extracting the state of the model (i.e., the population in each compartment at each time step), for ease of analysis I also extracted the cumulative number of infections up to that time step. The calculation for this extraction of cumulative incidence was incorrect.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#the-solution","title":"The solution","text":"The error occurred because susceptible people in my model were not all equally susceptible, and I failed to account for this when I calculated the cumulative number of infections at each time step. I identified that this was the problem by running some targeted test parameter sets and observing the changes in model output. The next step was to find out how long this bug had existed in the code and which analyses had been affected. While I was using version control, I tended to make large infrequent commits. I did, however, keep extensive hand-written notes in lab books, which played the role of a detailed history of commits. Searching through my historical lab books, I identified that I had introduced this bug into the code two years earlier. I was able to determine which parts of my results would have been affected by the bug and made the decision that all experiments needed to be re-run.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#how-version-control-helped","title":"How version control helped","text":"Using a pen and paper form of version control enabled me to pinpoint the introduction of the error and identify the affected analyses, but it was a tedious process. While keeping an immaculate record of changes that I had made was invaluable, imagine how much simpler and faster the process would have been if I had been a regular user of an electronic version control system such as Git!
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/","title":"Incorrect data in a pre-print figure","text":"Author: Rob Moss (rgmoss@unimelb.edu.au
)
Project: COVID-19 scenario modelling (public repository)
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#the-problem","title":"The problem","text":"Our colleague James Trauer notified us that they suspected there was an error in Figure 2 of our COVID-19 scenario modelling pre-print article. This figure showed model predictions of the daily ICU admission demand in an unmitigated COVID-19 pandemic, and in a COVID-19 pandemic with case targeted public health measures. I inspected the script responsible for plotting this figure, and confirmed that I had mistakenly plotted the combined demand for ward and ICU beds, instead of the demand for ICU beds alone.
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#the-solution","title":"The solution","text":"This mistake was simple to correct, but the obvious concern was whether any other outputs related to ICU bed demand were affected.
We conducted a detailed review of all data analysis scripts and outputs, and confirmed that this error only affected this single manuscript figure. It had no bearing on the impact of the interventions in each model scenario. Importantly, it did not affect any of the simulation outputs, summary tables, and/or figures that were included in our reports to government.
The corrected figure can be seen in the published article.
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#how-version-control-helped","title":"How version control helped","text":"Because we used version control to record the development history of the model and all of the simulation analyses, we were able to easily inspect the repository state at the time of each prior analysis. This greatly simplified the review process, and ensured that we were inspecting the code exactly as it was when we produced each analysis.
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/","title":"Fixing a bug in pypfilt","text":"Author: Rob Moss (rgmoss@unimelb.edu.au
)
Project: pypfilt, a bootstrap particle filter for Python
Date: 27 October 2021
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/#overview","title":"Overview","text":"I introduced a bug when I modified a function in my pypfilt package, and only detected the bug after I had created several more commits.
To resolve this bug, I had to:
Notice the bug;
Identify the cause of the bug;
Write a test case to check whether the bug is present; and
Fix the bug.
I noticed that a regression test1 was failing: re-running a set of model simulations was no longer generating the same output. The results had changed, but none of my recent commits should have had this effect.
I should have noticed this when I created the commit that introduced this bug, but:
I had not pushed the most recent commits to the upstream repository, where all of the test cases are run automatically every time a new commit is pushed; and
I had not run the test cases on my laptop after making each of the recent commits, because this takes a few minutes and I was lazy.
I knew that the bug had been introduced quite recently, and I knew that it affected a specific function: earlier_states()
. Running git blame src/pypfilt/state.py
indicated that the recent commit 408b5f1
was a likely culprit, because it changed many lines in this function.
In particular, I suspected the bug was occurring in the following loop, which steps backwards in time and handles the case where model simulations are reordered:
# Start with the parent indices for the current particles, which allow us\n# to look back one time-step.\nparent_ixs = np.copy(hist['prev_ix'][ix])\n\n# Continue looking back one time-step, and only update the parent indices\n# at time-step T if the particles were resampled at time-step T+1.\nfor i in range(1, steps):\n step_ix = ix - i\n if hist['resampled'][step_ix + 1, 0]:\n parent_ixs = hist['prev_ix'][step_ix, parent_ixs]\n
In stepping through this code, I identified that the following line was incorrect:
if hist['resampled'][step_ix + 1, 0]:\n
and that changing step_ix + 1
to step_ix
should fix the bug.
Note: I could have used git bisect
to identify the commit that introduced this bug, but running all of the test cases for each commit is relatively time-consuming; since I knew that the bug had been introduced quite recently, I chose to use git blame
.
I wrote a test case test_earlier_state()
that called this earlier_states()
function a number of times, and checked that each set of model simulations were returned in the correct order.
This test case checks that:
If the model simulations were not reordered, the original ordering is always returned;
If the model simulations were reordered at some time t_0
, the original ordering is returned for times t < t_0
; and
If the model simulations were reordered at some time t_0
, the new ordering is returned for times t >= t_0
.
This test case failed when I reran the testing pipeline, which indicated that it identified the bug.
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/#fix-the-bug","title":"Fix the bug","text":"With the test case now written, I was able to verify that that changing step_ix + 1
to step_ix
did fix the bug.
I added the test case and the bug fix in commit 9dcf621
.
In the commit message I indicated:
Where the bug was located: the earlier_states()
function;
When the bug was introduced: commit 408b5f1
; and
Why the bug was not detected when I created commit 408b5f1
.
A regression test checks that a commit hasn't changed an existing behaviour or functionality.\u00a0\u21a9
This section contains summaries of each Community of Practice meeting.
8 August 2024: orientation guide planning.
11 July 2024: presentation from Nefel Tellioglu.
13 June 2024: presentation from Cam Zachreson.
9 May 2024: presentation from TK Le.
11 April 2024: ideas and resources for the orientation guide.
19 February 2024: identify goals and activities for 2024.
18 October 2023: sharing experiences about good ways to structure a project.
15 August 2023: changing our research and reproducibility practices.
13 June 2023: exploration of version control, reproducibility, and testing exercises.
17 April 2023: our initial meeting.
This is our initial meeting. The goal is to welcome people to the community and outline how we envision running these Community of Practice meetings.
"},{"location":"community/meetings/2023-04-17/#theme-reproducible-research","title":"Theme: Reproducible Research","text":"Outline the theme and scope for this community.
This is open to all researchers who share an interest in reproducible research and/or related topics and practices; no prior knowledge is required.
For example, consider these questions:
Can you reproduce your current results on a new computer?
Can someone else reproduce your current results?
Can someone else reproduce your current results without your help?
Can you reproduce your own results from, say, 2 years ago?
Can someone else reproduce your own results from, say, 2 years ago?
Can you fix a mistake and update your own results from, say, 2 years ago?
Tip
The biggest challenge can often be remembering what you did and how you did it.
Making small changes to your practices can greatly improve reproducibilty!
"},{"location":"community/meetings/2023-04-17/#how-will-these-meetings-run","title":"How will these meetings run?","text":"Aim to hold these meetings on a (roughly) monthly basis.
Prior to each meeting, we will invite community members to propose a topic or discussion point to be the focus of the meeting. This may be an open question or challenge, an example of good research practices, a useful software tool, etc.
Schedule each meeting to best suit the availability of community members who are particularly interested in the chosen topic.
Each meeting should be hosted by one or more community members, with online participation available to those who cannot attend in person.
At the end of each meeting, we will ask attendees how useful/effective they found the meeting, so that we can better cater to the needs of the community. For example:
What do you think of the session?
What could we do better in the next session?
We will summarise the key observations, findings, and outputs of each meeting in our online materials, and use them to improve and grow our training materials.
Info
To function effectively as a community, we need to support asynchronous discussions in addition to scheduled meetings.
One option is a dedicated mailing list. Other options were suggested:
A Slack workspace (Dave);
A Discord channel (TK);
A Teams channel (Gerry); and
A private GitHub repository, using the issue tracker (Alex).
Using a GitHub issue tracker might also serve as a gentle introduction to GitHub?
"},{"location":"community/meetings/2023-04-17/#supporting-activities-and-resources","title":"Supporting activities and resources?","text":"Are there other activities that we could organise to help support the community?
We have online training materials, Git is my lab book, which should be useful for people who are not familiar with version control.
We also have a SPECTRUM/SPARK peer review team, where people can make their code available for peer review.
We asked each participant to suggest topics that they would like to see covered in future meetings and/or activities. A number of common themes emerged.
"},{"location":"community/meetings/2023-04-17/#version-control-from-theory-to-practice","title":"Version control: from theory to practice","text":"A number of people mentioned now being sure how to get started, or starting with good intentions but ending up with a mess.
Dave: how can I transition from principle to practice?
Ollie: similar to David, I often start well but end up with a mess.
Ruarai: what other have found useful and applied in this space, what options are out there?
Michael: I'm a complete novice, git command lines are a foreign language to me! I'm looking for tips for someone who uses code a lot, experienced at coding but much less so on version control and the use of repositories. What are the first steps to incorporate it into my workflow?
Angus: I'm also relatively new to Git and have been using GitHub Desktop (a GUI for Windows and Mac). I'm not averse to command line stuff but I need to remember fewer arcane commands!
Samik: I use TortoiseGit \u2014 a Windows Shell Interface to Git.
Gray: I resonate with Michael, I do most of my research on my own and describe it in papers. It isn't particularly Git-friendly, I'm keen to learn.
Lauren: everything that everyone has said so far! I've found some good guidelines for how to write reproducible code, starting from the basics all the way to niche topics. Can we use this as a way to share materials that we've sound useful? The British Ecological Society have published guidelines. We could assemble good materials that start from basics.
David: The Society for Open, Reliable, and Transparent Ecology and Evolutionary Biology (SORTEE) also have good materials.
Gerry: I like the idea of reproducibility and I've done a terrible job of it in the past, my repository ends up with thousands of versions of different files. Can you help me solve it?
Josh: Along the same lines of what's been stated. How best to share knowledge of Git and best practices with others in a new research team? How to adjust to their methods of conducting reproducible research, version control, etc?
Punya: not much to add, would really like to know more about version control, I have a basic understanding, what's the standard way of using it, reproducibility and documentation.
Rachel: I strongly support this idea of code reproducibility. Best practice frameworks can be disseminated to modellers in modelling consortia, and they can be very helpful when auditing.
Ella: we're migrating models from Excel to R.
J'Belle: I work for a tiny, very remote health service at the Australian and Papua New Guinea border. We have 17 sources of clinical data, which presents massive challenges in reproducibility and quality assurance. I'm looking to tap into your expertise. How do we manage so many sources of clinical data?
How can we make best use of existing tools and practices, while working with collaborators who have less technical expertise/experience?
Alex: if I start a project with collaborators who may be less technically literate, how can they contribute without messing up reproducibility? Options like Docker are a little too complicated. How can I motivate people, is there a simple solution?
Angus: in theory you may have reproducible code. But if you need to write a paper with less technical collaborators, running the code and generating reports can be hard. How do we collaborate on the writing side? RMarkdown and equivalents makes a lot of sense, but most colleagues will only look at Word documents. There are some workarounds, such as pandoc.
How far can/should we go in validating and demonstrating that our models and analyses are reproducible? How can we automate this? How is this impacted when we cannot share the input data, or when our models are extremely large and complex?
Cam: there are unique issues in the type of research we do. Working with code makes it easy in some ways, as opposed to experimental conditions in real-world experiments. Our capacity for reproducibility is great, but so then is our burden. We should be exploring the limitations! Some challenges in our area come down to implementation of stochastic models with lots of random processes. How can we do that well and make it part of what we practice? What are the limitations to reproducibility and how do we perceive the goals when we are working when the data cannot be shared?
Samik: similar to Cam, I'm interested in how people have produced reproducible research where the data cannot be shared. Perhaps we can provide default examples as test cases?
Michael: I second Cam's points, particularly about reproducibility with confidential data. That's an issue I've hit multiple times. We usually have a side folder with the real dataset, and pipe through condensed or aggregated versions of the data that we can release.
Jiahao: I'm interested in how to build a platform for using agent based models. I've looked at lots of other models, but how can we bring them together so that it is easier to add a new variable or extend a model?
Eamon: I'm a Git fanatic, and I want to understand the development of code that I work with. I get annoyed when people share a repository as a single commit. People who don't use tags in their Git repositories to identify the version of the code they used in, e.g., a paper! How do you start running the code? What file formats does it expect to process?
Dion: I'm interested in seeing what people are doing that look like good practice. Making sure that code and results are reproducible, in the sense that your code may be version controlled, but you've since made changes to code, parameters, input data, etc. How do you do a good job to shoe-horn that all into Git? Maybe use Git for development and simultaneously use a separate repository for production stuff? We need to be able to look back and identify from the commit the code, data, intermediate files used along the way.
Palang: I've looked at papers with supplementary code, but running the code myself produces very different results from what's in the paper.
May: most people have said what I wanted to say. I faced similar problems with running other people's code. It may not print any error message, but you get very different results from what's published in the paper. You don't know who or what is wrong!
How can we develop confidence in our own code, and in other people's code?
TK: I want to learn/see different conventions for writing code documentation. I've never managed to get doxygen working to my satisfaction.
Angus: how do we design good tests? How to test, when to test, what to test for? Should we have coverage targets? Are there ways to automated testing?
Rahmat: I often find it very hard to learn how to use other people's code. The code needs to be easy to understand. Otherwise, I will just write the code myself! Sometimes when I run the code, I have difficulties in generating results, many errors come up and it's not clear why. Perhaps all of the necessary data have not been shared with the code? We need to include the data, but if the data cannot be provided, you need to provide similar data so that other can run the code. It also helps to use a language that others are familiar with.
Pan: I am not sure about the term reproducibility in the context of coders. I know lab people really do reuse published protocols. But do coders actually reuse other people's code to do their work?
Gerry: People often make their code into packages which others reuse. This could be a good topic for future meetings.
Pan: I recently joined a meeting where people have used Chat GPT to check their code. Does this group have any thoughts on how we might make good use of Chat GPT?
Cam: Chat GPT is not reproducible itself, so it seems questionable to use it to check reproducibility.
Alex: I don't entirely agree, it can be very useful for improving the implementation of a function. In terms of generating reliable code, it's wonderful. It's a nightmare for evaluating existing code.
Pan: people are using Chat GPT to generate initial templates.
Eamon: If you encounter code that has poor documentation, Chat GPT is surprisingly good at telling you how to use it.
Matt: I don't have anything to add to the above, I'm happy to be along for the ride.
In this meeting we asked participants to share their experiences exploring the version control, reproducibility, and testing exercises in our example repository.
This repository serves an introduction to testing models and ensuring that their outputs are reproducible. It contains a simple stochastic model that draws samples from a normal distribution, and some tests that check whether the model outputs are consistent with our expectations.
"},{"location":"community/meetings/2023-06-13/#what-is-a-reproducible-environment","title":"What is a reproducible environment?","text":"The exercise description was deliberately very open, but it may have been too vague:
Define a reproducible environment in which the model can run.
We avoided listing possible details for people to consider, such as software and package versions. Perhaps a better approach would have been to ask:
If this code was provided as supporting materials for a paper, what other information would you need in order to run it and be confident of obtaining the same results as the original authors?
The purpose of a reproducible environment is to define all of these details, so that you never have to say to someone \"well, it runs fine on my machine\".
"},{"location":"community/meetings/2023-06-13/#reproducibility-and-stochasticity","title":"Reproducibility and stochasticity","text":"Many participants observed that the model was not reproducible unless we used a random number generator (RNG) with a known seed, which would ensure that the model produces the same output each time we run it.
But what if you're using a package or library that internally uses their own RNG and/or seed? This may not be something you can fix, but you should be able to detect it by running the model multiple times with the same seed, and checking whether you get identical result each time.
Another important question was raised: do you, or should you, include the RNG seed in your published code? This is probably a good idea, and suggested solutions included setting the seed at the very start of your code (so that it's immediately visible) or including it as a required model parameter.
"},{"location":"community/meetings/2023-06-13/#writing-test-cases","title":"Writing test cases","text":"Tip
Write a test case every time you find a bug: ensure that the test case finds the bug, then fix the bug, then ensure that the test case passes.
A test case is a piece of code that checks that something behaves as expected. This can be as simple as checking that a mathematical function returns an expected value, to running many model simulations and verifying that a summary statistic falls within an expected range.
Rather than trying to write a single test that checks many different properties of a piece of code, it can be much simpler and quicker to write many separate tests that each check a single property. This can provide more detailed feedback when one or more test cases fail.
Note
This approach is similar to how we rely on multiple public health interventions to protect against disease outbreaks! Consider each test case as a slice of Swiss cheese \u2014 many imperfect tests can provide a high degree of confidence in our code.
"},{"location":"community/meetings/2023-06-13/#writing-test-cases-for-conditions-that-may-fail","title":"Writing test cases for conditions that may fail","text":"If you are testing a stochastic model, you may find certain test cases are difficult to write.
For example, consider a stochastic SIR model where you want to test that an intervention reduces the number of cases in an outbreak. You may, however, observe that in a small proportion of simulations the intervention has no effect (or it may even increase the number of cases).
One approach is to run many pairs of simulations and only check that the intervention reduced the number of cases at least X% of the time. You need to decide how many simulations to run, and what is an appropriate value for X%, but that's okay! Remember the Swiss cheese analogy, mentioned above.
"},{"location":"community/meetings/2023-06-13/#testing-frameworks","title":"Testing frameworks","text":"If you have more than 2 or 3 test cases, it's a good idea to use a testing framework to automatically find your test cases, run each test, record whether it passed or failed, and report the results. These frameworks are usually specific to a single programming language.
Some commonly-used frameworks include:
Multiple participants reported some difficulties in setting up GitHub actions and knowing how to adapt available templates to their needs. See the following examples:
We will aim to provide a GitHub action workflow for each model, and add comments to explain how to adapt these templates.
Warning
One downside of using GitHub Actions is the limited computation time of 2,000 minutes per month. This may not be suitable for large agent-based models and other long-running tasks.
"},{"location":"community/meetings/2023-06-13/#pull-requests","title":"Pull requests","text":"At the time of writing, three participants have contributed pull requests:
TK added a default seed so that the model outputs are reproducible.
Micheal added a MATLAB version of the model and the test cases.
Cam added several features, such as recording metadata about the Python environment and testing that the model outputs are reproducible.
Tip
If you make your own copy (\"fork\") of the example repository, you can create as many commits as you want. GitHub will display a message that says:
This branch is N commits ahead of rrcop:master.
Click on the \"N commits ahead\" link to see a summary of your new commits. You can then click the big green button \"Create pull request\".
This will not modify the example repository. Instead, it will create an overview of the changes between your code and the example repository. We can then review these changes, make suggestions, you can add new commits, etc, before deciding whether to add these changes to the example repository.
"},{"location":"community/meetings/2023-08-15/","title":"15 August 2023","text":"Info
See the Resources section for links to useful resources that were mentioned in this meeting.
"},{"location":"community/meetings/2023-08-15/#changes-to-practices","title":"Changes to practices","text":"In this meeting we asked everyone what changes (if any) they have made to their research and reproducibility practices since our last meeting.
A common theme was improving how we note and record our past actions. For example:
Eamon has begun recording the commit ID (\"hash\") of the code that was used to generate each set of outputs. This allows him to easily retrieve the exact version of the code that was used to generate any past result and, e.g., generate other outputs of interest.
Pan talked about how their group records raw separately from, but grouped with, the analysis code and processed data that were generated from these raw data. They also record every step of their model-fitting process, which may not always go as smoothly as expected.
This ensures that stakeholders who want to use these models to run their own scenarios can reproduce the baseline scenarios without being modelling experts themselves.
The model is available as an online app.
Gizem asked the group \"How do you choose an appropriate project structure, especially if the project changes over time?\"
Phrutsamon: the TIER Protocol 4.0 provides a template for organising the contents and reproduction documentation for projects that involve working with statistical data.
Rob: there may not be a single perfect solution that addresses everyone's needs. But look back at past projects, and try to imagine how the current project might change in the future. And if you're using version control, don't be afraid to experiment with different project structures \u2014 you can always revert back to an earlier commit.
"},{"location":"community/meetings/2023-08-15/#reviewing-code-as-part-of-manuscript-peer-review","title":"Reviewing code as part of (manuscript) peer review","text":"Rob asked the group \"Has anyone reviewed supporting code when reviewing a manuscript?\"
Ruarai read through R code that was provided with a paper, but was unable to run all of it \u2014 some of the code produced errors.
Similarly, Rob has read R code provided with a paper that used hard-coded paths that did not exist (e.g., \"C:\\Users\\<Author Name>\\...\"
), tried to run code in source files that did not exist, and read data from CSV files that did not exist.
Info
Pan mentioned a fantastic exercise for research students.
Pick a modelling paper that is relevant to their research project, and ask the student to:
This teaches the students that reproducibility is very important, and shows them what they need to do when they publish their own results.
It's important to pick a relatively simple paper, so that this task isn't too complicated for the student. And if the paper is written by a colleague or collaborator, you can contact them to ask for extra details, etc.
"},{"location":"community/meetings/2023-08-15/#using-shiny-to-make-models-availablereproducible","title":"Using Shiny to make models available/reproducible","text":"Pan asked the group \"What do you think about (the extra work involved in) turning R code into Shiny applications, to show that the model is reproducible, and do so in a way that lets others easily make use it?\"
An objective of the COVID-19 International Modelling Consortium (CoMo) is to make models available and usable for non-modellers \u2014 turning models into something that anyone with minimal knowledge can explore.
The model is available as a Shiny app, and is continually being updated and refined. It is currently at version 19! Pan's group is trying to ensure that existing users update to the most recent version, because it can be very challenging and time-consuming to create scenario templates for older model versions. Templates are a good way to help the user define their scenario-specific settings, but it's a nightmare when you change the model version \u2014 it's like working with a new model.
Eamon: this is similar to when software platforms make changes to their APIs. Can you make backwards-compatible changes, or automatically transform old templates to make them compatible with the latest model version? This kind of work is simple to fund when your software is a commercial product, but it's much harder to find funding for academic outputs.
Pan: It's a lot of extra work, without any money to support it. For this consortium we hired several programmers, some for the coding, some specifically for the Shiny app, it involved a lot of resources. That project has now ended, but we've learned a lot and have a good network of collaborators. We still have monthly meetings! This was a special case with COVID-19, because the context changed so quickly. It would be much less of a problem with other diseases, which we better understood.
Gizem: very much in favour of using Shiny to make models available, and recently made a Shiny app for their latest project (currently under review). Because the model is very complicated, we had to pre-calculate model results for specific parameter combinations, and only allow users to choose between these parameter combinations. One reviewer asked for a modified figure to show results for slightly different parameter values, and it was quite simple to address.
Hadley Wickham has written a very good book about developing R Shiny applications. Gizem read a chapter of this book each morning, but found it necessary to practice in order to really understand how to use Shiny.
Info
Learning by doing (experiential learning) is a highly-effective way of convincing people to change their practices. It can be greatly enhanced by engaging as a community.
"},{"location":"community/meetings/2023-08-15/#resources","title":"Resources","text":""},{"location":"community/meetings/2023-08-15/#teaching-reproducibility-and-responsible-workflows","title":"Teaching reproducibility and responsible workflows","text":"The Journal of Statistics and Data Science Education published a special issue: Teaching Reproducibility in November 2022. The accompanying editorial article highlights:
Integrating reproducibility into our practice and our teaching can seem intimidating initially. One way forward is to start small. Make one small change to add an element of exposing students to reproducibility in one class, then make another the next semester. Our students can get much of the benefit of reproducible and responsible workflows even if we just make a few small changes in our teaching. These efforts will help them to make more trustworthy insights from data. If it leads, by way of some virtuous cycle, to us improving our own practice, then even better! Improving our teaching through providing curricular guidance about reproducible science will take time and effort that should pay off in the long term.
This journal issue was followed by an invited paper session with the following presentations:
Collaborative writing workflows: building blocks towards reproducibility
Opinionated practices for teaching reproducibility: motivation, guided instruction, and practice
From teaching to practice: Insights from the Toronto Reproducibility Conferences
Teaching reproducibility and responsible workflow: an editor's perspective
Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.
"},{"location":"community/meetings/2023-08-15/#using-shiny","title":"Using Shiny","text":"Mastering Shiny: an online book that teaches how to create web applications with R and Shiny.
CoMo Consortium App: the COVID-19 International Modelling Consortium (CoMo) has developed a web application for an age-structured, compartmental SEIRS model.
Building reproducible analytical pipelines with R: this article shows how to use GitHub Actions to run R code when you push new commits to a GitHub repository.
GitHub Actions for the R language: this repository provides a variety of GitHub actions for R projects, such as installing specific versions of R and R packages.
See the GitHub actions for Git is my lab book, available here. For example, the build action performs the following actions:
Check out the repository, using actions/checkout
;
Install mdBook and other required tools, using make.
Build a HTML version of the book, using mdBook
.
In this meeting we asked participants to share their experiences about good (and bad) ways to structure a project.
Info
We are currently drafting Project structure and Writing code guidelines.
See the pull request for further details. Please contribute suggestions!
We had six in-person and eight online attendees. Everyone predominantly uses one or more of the following languages:
The tidyverse style guide includes recommendations for naming files. One interesting recommendation in this guide is:
If files should be run in a particular order, prefix each file name with a number. For example:
00_download.R\n01_clean.R\n02_summarise.R\n...\n09_plot_figures.R\n10_generate_tables.R\n
A common starting point is often one or more scripts in the root directory. But we can usually divide a project into several distinct steps or stages, and store the files necessary for each stage in a separate sub-directory.
Tip
Your project structure may change as the project develops. That's okay!
You might, e.g., realise that some files should be moved to a new, or different, sub-directory.
Packaging: Python and R allow you to bundle multiple code files into a \"package\". This makes it easier to use code that is split up into multiple files. It also makes it simpler to test and verify whether your code can be run on a different computer. To create a package, you need to provide some metadata, including a list of dependencies (packages or libraries that your code needs in order to run). When installing a Python or R package, it will automatically install the necessary dependencies too. You test this out on, e.g., a virtual machine to verify that you've correctly listed all of the necessary dependencies.
Version control: the history may be extremely useful for you, but may contain things you don't want to make publicly available. One solution would be to know from the very start what files you will want to make available and what files you do not (e.g., sensitive data files), but this is not always possible. Another, more realistic, solution is to create a new repository, copy over all of the files that you want to make available, and record these files in a single commit. The public repository will not share the history of your project repository, and that's okay \u2014 the public repository's primary purpose is to serve as a snapshot, rather than a complete and unedited history.
"},{"location":"community/meetings/2023-10-18/#locating-files","title":"Locating files","text":"A common concern how to locate files in different sub-directories (e.g., loading code, reading data files, writing output files) without relying on using absolute paths. For loading code, Python and Matlab allow the user to add directories to the search path (e.g., by modifying sys.path
in Python, or calling addpath()
in Matlab). But these are not ideal solutions.
As a general rule, prefer using relative paths instead of absolute paths.
Relative paths are defined relative to the current working directory. For example: sub-directory/file-name
and ../other-directory
.
Absolute paths are defined relative to the root drive or directory. For example: /Users/My Name/...
and C:\\Users\\My Name\\...
.
Absolute paths may not exist on other people's computers.
project/input-data/file-1.csv
and a script file in project/analysis-code/read-input-data.R
, you can locate the data file from within the script with the following code:library(here)\ndata_file <- here(\"input-data/file-1.csv\")\n
Tip
A general solution for any programming language is to break your code into functions, each of which accepts input and/or output file names as arguments (when required). This means that most of your code is entirely independent of your chosen project structure. You can then store/generate all of the file paths in a single file, or in each of your top-level scripts.
"},{"location":"community/meetings/2023-10-18/#peer-review-get-feedback-on-project-structure","title":"Peer review: get feedback on project structure","text":"It can be helpful to get feedback from someone who isn't directly involved in the project. They may view the work from a fresh perspective, and be able to identify aspects that are confusing or unclear.
When inviting someone to review your work, you should identify specific questions or tasks that you would like the reviewer to address.
With respect to project structure, you may want to ask the reviewer to address questions such as:
README.md
file help you to navigate the project?You could also ask the reviewer to look at a specific script or code file, and ask questions such as:
Info
For further ideas about useful peer review activities, and how to incorporate them into your workflow, see the following paper:
Implementing code review in the scientific workflow: Insights from ecology and evolutionary biology, Ivimey-Cook et al., Journal of Evolutionary Biology 36(10):1347\u20131356, 2023.
"},{"location":"community/meetings/2023-10-18/#styling-and-formatting","title":"Styling and formatting","text":"We also discussed opinions about how to name functions, variables, files, etc.
For example, R allows you to use periods (.
) in function and variable names, but the tidyverse style guide recommends only using lowercase letters, numbers, and underscores (_
).
If you review other people's code, and have other people review your code, you might be surprised by the different styles and conventions that people use. When reviewing code, these differences can be somewhat distracting.
Agreeing on, and adhering to, a common style guide can avoid these issues and allow the reviewer to dedicate their attention to actually reading and reasoning about the code.
There are tools to automatically format your code (\"formatters\") and to warn about potential issues, such as unused variables (\"linters\"). Here are some commonly-used formatters and linters for different languages:
Language Style guide(s) Formatter Linter R tidyverse styler lintr Python PEP 8 / The Hitchhiker's Style Guide black ruff Julia style guide JuliaFormatter.jl Lint.jlThere are AI tools that you can use to write, format, and review code, but you will need to check whether the code is correct. For example, GitHub Copilot is a (commercial) tool that accepts natural-language descriptions and generates computer code.
Tip
Feel free to use AI tools as a way to get started, but don't simply copy-and-paste the code they give you without reviewing it.
"},{"location":"community/meetings/2024-02-19/","title":"19 February 2024","text":"In this meeting we asked participants to suggest goals and activities to achieve in 2024.
Note
If you were unable to attend the meeting, you can send us suggestions via email.
We have identified the following goals for 2024:
Develop orientation materials for new students and staff;
Share examples of project repositories and model implementations;
Build expertise in testing your own code;
Use peer code review to share knowledge and develop coding skills;
See the sections below for further details.
"},{"location":"community/meetings/2024-02-19/#orientation-materials","title":"Orientation materials","text":"The first suggestion was to develop orientation materials for new students, postdocs, people coming from wet-lab backgrounds, etc. Suggested topics included:
Info
Some of these topics are already covered in Git is my lab book; see the links above.
There was broad interest in having a checklist, and example workflows for people to follow \u2014 particularly for projects that involve some form of code \"hand over\", to ensure that the recipients experience few problems in running the code themselves.
We aim to address these topics in Git is my lab book, with assistance and feedback from the community. See the How to contribute page for details.
"},{"location":"community/meetings/2024-02-19/#example-projects-and-model-templates","title":"Example projects and model templates","text":"Building on the idea of orientation materials, a number of participants suggested providing example projects and different types of models.
The most commonly used languages in our community are:
As an example, we could demonstrate how to write an age-stratified SEIR ODE model in R and Python, and how to write agent-based models in vector and object-oriented forms.
Info
GitHub allows you to create template repositories, which might be a useful way of sharing such examples. We could host these template repositories in our Community of Practice GitHub organisation.
"},{"location":"community/meetings/2024-02-19/#how-and-why-to-begin-testing-your-code","title":"How (and why!) to begin testing your code","text":"We asked everyone whether they'd ever found bugs in their code, and were relieved to see that yes, all of us have made mistakes! Writing test cases in one way to check that your code is behaving in the way that you expect.
But it can be hard to figure out how to actually write useful tests.
You can make your code easier to test if you structure your code well, and consider how to test it when you start coding.
As an example, Cam mentioned that he had written a stochastic model of a hospital ward, in which there was a queue of tasks. At the end of a shift, some tasks may not have been done, and these are put back on the queue for the next shift. Cam discovered there was a bug in the way this was done, and fixed it. However, later on he reintroduced the same bug. This is precisely the situation where regression tests are useful. In brief:
When you discover a bug in your code, write a test that detects this bug before you fix it.
You now have a test that will identify this bug if you ever make the same mistake again.
But you need to run this test whenever you modify your code.
Continuous Integration (CI) is one way to run tests automatically, whenever you push a commit to a platform such as GitHub or GitLab. See the list of resources shared in our previous meeting for some examples of using CI.
In our community we have a number of people with familiarity and expertise in testing infectious disease models and related code. We need to share this knowledge and help others in the community to learn how to test their code.
"},{"location":"community/meetings/2024-02-19/#peer-code-review","title":"Peer code review","text":"We talked about how we improve our writing skills by receiving feedback on drafts from colleagues, supervisors, etc, and how a similar approach can be extremely useful for improving our coding skills.
Info
A goal for this year is to review each other's code! Note that we have developed some guidelines for peer code review.
Suggestions included:
Before submitting a paper to a journal, ask someone else to run your code.
Rob has been working on a within-host malaria model, which is written in R and uses Continuous Integration to generate R Markdown documents every time the code is updated. He is happy to share this for code review, so that:
Everyone can see a working example of continuous integration; and
He can receive feedback on the code, since he is not an experienced R programmer.
A number of participants expressed a willingness to participate in peer code review.
Angus noted that it can be difficult to identify a discrete chunk of digestible code to offer up for peer review.
We can coordinate peer code review through our Community of Practice GitHub organisation.
"},{"location":"community/meetings/2024-02-19/#sharing-code-but-not-the-original-data","title":"Sharing code but not the original data","text":"Samik mentioned that in a recent paper, The impact of Covid-19 vaccination in Aotearoa New Zealand: A modelling study, the code is provided in a public GitHub repository, but that they do not have permission to share the original health data.
Info
We can frequently encounter this issue when working with public health and clinical data sets.
What are the best practices in this case?
One option is to generate dummy data by, e.g., resampling or adding noise to the original data. You could then inform the reader that they should obtain similar, but not necessarily identical results.
You can also use a checksum algorithm (such as SHA-2) and include the checksum for each original data file in the public repository. This would allow users who can obtain access to the original data to verify that they are using identical data.
In this meeting we asked participants to suggest specific tools, templates, packages, etc, that we should include in our Orientation guide. We used the high-level topics proposed in our previous meeting as a starting point.
Attendance: 7 in person, 9 online.
Git is my lab book updates
We have switched to Material for MkDocs, which greatly improves the user experience.
For example, use the search bar above (press F) to interactively search the entire website.
"},{"location":"community/meetings/2024-04-11/#purpose-of-the-guide","title":"Purpose of the guide","text":"We are aiming to keep the orientation guide short and simple, to avoid overwhelming newcomers.
James Ong: If we can agree on a structure, we can then get people to contribute to specific sections.
TK Le: schedule a one-hour meeting where everyone works on writing content for 30 minutes, and then reviews each others' content for 30 minutes?
"},{"location":"community/meetings/2024-04-11/#project-organisation","title":"Project organisation","text":"Key message
A project's structure may need to change over time, and that's fine. What matters is that the structure is explained.
A common theme raised by a number of participants was deciding how to organise your files, dividing projects into cohesive parts, and explaining the relationships between these parts (i.e., how they interact or come together).
Useful tools mentioned in this conversation included:
Using git to track your files;
Using tools such as the here
package for R to locate your files without resorting to hard-coded paths or changing the working directory;
Defining your project's workflow with a pipeline tool such as the targets
package for R.
Info
We are drafting topical guides about these topics. See the online previews for the following guides:
If you have any suggestions or feedback, please let us know in the pull request!
"},{"location":"community/meetings/2024-04-11/#working-collaboratively","title":"Working collaboratively","text":"Key message
Plan to work collaboratively from the outset. It is highly likely that someone else will end up using your code.
Nick Tierney: you are always collaborating with your future self!
One concern raised was how best to prepare your code for handover.
Pan: You need to think about it from the beginning. There will be more and more people trying to use existing models. I am writing a guideline about vaccination modelling, and referring to readers as the \"model user\" (developers, modellers, end users). If we plan for others to use our model, we need to develop the model in a way that aims to make it easier for people to use.
Reminder
We have developed a topical guide on how to use git for collaborative research.
"},{"location":"community/meetings/2024-04-11/#reviewing-code-and-project-structure","title":"Reviewing code and project structure","text":"Key message
Feedback from other people can be extremely useful to identify sources of confusion.
The earlier that someone can review your code and project structure, the easier it should be to act on their feedback.
Saras Windecker mentioned that the Infectious Disease Ecology and Modelling (IDEM) team organised code review sessions that the team found really informative, but also reminded everyone how hard it is to have guidelines that are consistent and broadly useful.
Question: was the purpose of these sessions to review code, or to review the project structure?
They were intended to review code, but team members found they had to review the project structure before they could begin to understand and improve the code.
Question: What materials, inputs, resources, etc, can we provide people who are dealing with messy code?
Rob Moss reflected on his experience of picking up a within-host malaria RBC spleen model and how difficult it was to identify which parts of the model were complete and which needed further work. He gradually divided the code into separate files, and regularly ran model simulations to verify that the outputs were consistent with the original code.
Info
Rob is happy to share the original model code, discuss how it was difficult to understand, and to walk through how he restructured and documented it. If you're interested, send him an email.
"},{"location":"community/meetings/2024-04-11/#how-to-structure-your-data","title":"How to structure your data","text":"Key message
If data are shared, they often lack the documentation to make them easy to reuse.
Nick Tierney asked if anyone had thoughts on how to structure their data? Consistent with our earlier discussion, he suggested that one of the most important things is to have a README
that explains the project structure. He then shared one of his recent papers Common-sense approaches to sharing tabular data alongside publication.
Question: do you advocate for data to be tidied (\"long\"), etc?
Key message
There are various ways to manage confidential data, each with pros and cons.
Michael Plank asked for ideas about how to manage confidential data when working with collaborators, to ensure that everyone is using the most recent version of the data. Obviously you don't want to commit the data files in your project repository, so the data files should be listed in your .gitignore
file.
One option is to store the confidential data in a separate git repository, with tightly controlled access permissions. You can keep a local copy in a separate directory. If you create a symbolic link to this directory inside your project repository, and add this symbolic link to your .gitignore
file, you can still use tools such as here
to access the data.
A second option, which was used by a number of teams that worked on COVID-19 projects for the Australian Department of Health, was to host the data on a secure data management platform (mediaflux). Every time new data were received, the data management groups would perform various quality checks and generate analysis-ready data files. They would then notify all of the teams, who would each download the latest data files as part of their computational pipeline.
The most suitable solution probably depends on a combination of factors, including:
Key message
Debugging is an important skill, but good coding practices are important for making your code easier to test and debug.
A number of people suggested that the orientation guide should provide some information about how to debug your code.
Nick Tierney: I could go on a long rant about debugging, and why we should be teaching how to divide code into small functions that are easier to test!
We also discussed that there are various ways to debug your code, from printing debugging messages, to using an interactive debugger, to writing test cases that explain how the code should work.
Rob Moss: I've used regression tests and continuous integration (CI) to detect when I accidentally change the outputs of my code/model. For example, my SMC package for Python includes tests cases that record simulation outputs, and after running the tests I check that the output files remain unchanged.
"},{"location":"community/meetings/2024-04-11/#guidelines-for-using-ai","title":"Guidelines for using AI","text":"Key message
Practices such as code review and testing are even more important for code that is created using generative AI.
Pan: speaking for junior students, a lot of students are now using ChatGPT for their coding, either to create the initial structure or to transform code from one programming language to another language.
Question: Can we provide useful guidelines for those students?
James: this probably isn't something we will cover in the orientation guide. But perhaps we need some guidelines for generative AI use in the topical guides.
Testing your code and ensuring it is reproducible is even more important when using ChatGPT. We know it doesn't always give you correct code, so how can you decide whether what it's given you is useful? It would be great to have an example of code generated by ChatGPT that is incorrect or unnecessarily difficult to understand, and to show how we can improve that code.
A question for the community
Does anyone have any examples of code produced by ChatGPT that didn't function as intended?
"},{"location":"community/meetings/2024-04-11/#useful-resources","title":"Useful resources","text":"The following resources were mentioned in this meeting:
Developing a modern data workflow for regularly updated data;
A computational pipeline for the paper Mapping trends in insecticide resistance phenotypes in African malaria vectors;
Common-sense approaches to sharing tabular data alongside publication;
The here
package for R; and
The targets
package for R.
In this meeting TK Le gave a presentation about a series of COVID-19 modelling projects and how their experiences were impacted by the choice of programming languages, model structure, editors, tools, etc.
Attendance: 4 in person, 4 online.
Info
We welcome presentations about research projects and experiences that relate in any way to reproducibility and good computational research practices. Presentations can be short, informal, and free-form.
Please contact us if you have anything you might like to present!
"},{"location":"community/meetings/2024-05-09/#three-projects-four-models","title":"Three projects, four models","text":"This work began in 2022, and was based on code that was originally written by Eamon and Camelia. TK adapted the code to run new scenarios across three separate modelling projects:
The workflow was divided into a series of distinct models:
An immunological model written in R and greta;
An agent-based model of population transmission written in C++ (post-processing written in R);
A clinical pathways model implemented in MATLAB; and
A cost-effectiveness model implemented in R.
TK's primary activities involved implementing different vaccine schedules in the transmission model, and data visualisation of outputs from the transmission model and the clinical pathways, all of which they implemented in Python.
"},{"location":"community/meetings/2024-05-09/#the-multi-model-structure","title":"The multi-model structure","text":"Key message
There isn't necessarily a single best way to structure a large project.
Question: Was it a benefit to have separate model implementations in different languages, with clearly defined data flows from one model to the next?
Conceptually yes, but this structure also involved a lot of trade-offs \u2014 even the sheer volume of data that had to be saved by one model and read into another. It was difficult to pick up and use the code as a whole. And there were related issues, such as being able to successfully install greta.
Nick Tierney: I know that greta can be difficult to install, and I can help people with this.
TK also noted that they found some minor inconsistencies between the models, such as whether a date was used to identify the start or end of its 24-hour interval.
"},{"location":"community/meetings/2024-05-09/#tools-and-platforms","title":"Tools and platforms","text":"Key message
Personal preferences can play a major role in deciding which tools are best for a project.
The various models were hosted in repositories on Bitbucket, GitHub, and the University of Melbourne's GitLab instance. TK noted that the only discernible differences between these platforms was how they handled authorisation and authentication.
TK also explored several different editors, some which were language-specific:
TK noted they had previous experience with Eclipse (an IDE for Java) and Visual Studio (which felt very \"heavy\").
Question: what were your favourite things in VS Code, and what made RStudio the worst?
It was easiest to open a project in VS Code, RStudio would always open up an entire project or previous workspace, rather than just opening a single file. RStudio also kept asking to update itself.
TK also strongly disliked the RStudio font, which was another source of friction. They tried installing an RStudio extension for VS Code, but weren't sure how well it worked.
Nick Tierney: R history is really annoying, but you can turn it off. I'm not sure why it's enabled by default, all of the RStudio recommendations involve turning off these kinds of things.
Rahmat Sagara: I'm potentially interested in using VS Code instead of RStudio.
Eamon Conway: the worst thing about VS code is that debugging is very hard to setup.
"},{"location":"community/meetings/2024-05-09/#task-management","title":"Task management","text":"Key message
Task management is hard, and switching to a new system during a project is extremely difficult.
TK reported trying to use Gitlab issues to plan out what to do and how to do it, but found they weren't a good fit with their workflow. They then trialled Trello boards for different versions, but stopped using them due to a lack of time to manage the boards. In review:
Rob Moss: we know that behaviour changes are hard to achieve, so it's not surprising that a large change was challenging to maintain \u2014 ideally we would make small, incremental changes, but this isn't always possible or useful.
Eamon Conway: I like the general idea of using task-tracking software, but I've settled on only using paper. It's always with me, it's always at home, and it's physically under my keyboard!
Ruarai Tobin: I use Notion with a single large Markdown file, you can paste screenshots.
"},{"location":"community/meetings/2024-05-09/#repository-structure","title":"Repository structure","text":"Key message
There are many factors and competing concerns that influence the repository structure.
The repository structure changed a lot across these three projects.
In the beginning, the main challenge was separating out the different parts. While this was achieved, it wasn't immediately obvious where a user was supposed to start \u2014 the file structure did not make it clear. The README.md
file did, however, include an explanation.
By the final project, the repository was divided into a number of directories, each of which was given a numeric prefix (e.g, 0_data
and 4_post_processing
). However, this was also a little misleading:
In order to run the code you start in the folder numbered 1;
The post processing in 4_post_processing
was interleaved between some of the other steps (it mostly contains plotting and visualisation code, but also some other stuff);
The structure also had to differ between running the code on Monash University's MASSIVE HPC platform, and on the University of Melbourne's Spartan HPC platform.
Question: is there an automated pipeline?
TK replied that the user had to run the code in each of the numbered folders in the correct (ascending) order, and that they wanted to make improvements for automating the dependent jobs on Spartan.
Eamon Conway: if you do ever automate it, we should share it with people (e.g., this community) because people may be able to learn from it when they want to use Spartan. I know how to use slurm for job management and can help you automate it.
"},{"location":"community/meetings/2024-05-09/#data-visualisation","title":"Data visualisation","text":"Key message
Producing good figures takes a lot of time, thought, and experimentation, and also a lot of coding.
It was extremely hard to decide what to show, and how to show it, in order to highlight key messages.
It was very easy to overwhelm the viewer with complicated plots and massive legends. For example, the scenarios involved three epidemic waves, and how can you show relationships between each pair of waves? It is relatively simple to build a 3D plot that shows these relationships, but the viewer can't really interpret the results.
"},{"location":"community/meetings/2024-05-09/#other-activities","title":"Other activities","text":"Key message
Following better practices would have required maybe 50% more time, but there wasn't funding \u2014 who will pay for this?
Dedicating time to other activities was not feasible \u2014 no one had time, these projects had fixed deadlines and it was challenging to complete the work within these deadlines.
As explained above, data visualisation took longer than expected. And sometimes the code simply wouldn't run on high-performance computing platforms. For example, sometimes Matlab just wouldn't load, there were intermittent failures for no apparent reason and with no useful error messages.
Activities that would have been nice to do, but were not undertaken, included:
Rob Moss: we're very unlikely to get funding to explicitly cover these activities. If possible, we need to allocate sufficient time in our budgets, as best as possible. Practising these skills on smaller scales can also help us to use them with less overhead in larger projects.
"},{"location":"community/meetings/2024-05-09/#version-control-and-publishing","title":"Version control and publishing","text":"Key message
This can be challenging even with all of the tools and infrastructure in place.
Question: were all of the projects wrapped up into one central thing?
No, they're all separate. The first project was provided as a zip file attached to the paper. The second project is in a public git repository. The final project is ongoing and remains confidential, it is stored in a private repository on the University of Melbourne's GitLab instance.
Question: did the latest project build on the previous ones?
Yes, and this led to managing current and older versions of the code. For example, TK found a bug that caused a minor discrepancy between times reported in two different models (see The multi-model structure) but it wasn't possible to update the older code and regenerate the associated outputs.
Question: should we use git (and GitHub) only for publication, or during the work itself?
Eamon Conway: Use it from the beginning to track your work, and maybe have different privacy settings (confidential code and/or data).
Rob Moss: you can use a git repository for your own purposes during the work, and upload a snapshot to Figshare or Zenodo to obtain a DOI that you can cite in your paper.
"},{"location":"community/meetings/2024-05-09/#broader-conclusions","title":"Broader conclusions","text":"Changing our behaviour and work habits is hard, and so is finding time to develop these skills. We need to practice these skills on small problems first, rather than on large projects (and definitely not when there are tight deadlines).
A question for the community
Should we organise an event to practice and develop these skills on small-scale problems?
"},{"location":"community/meetings/2024-06-13/","title":"13 June 2024","text":""},{"location":"community/meetings/2024-06-13/#cam-zachreson-a-comparison-of-three-abms","title":"Cam Zachreson: A comparison of three ABMs","text":"In this meeting Cam gave a presentation about the relative merits and trade-offs of three different approached for agent-based models (ABMs).
Attendance: 7 in person, 13 online.
"},{"location":"community/meetings/2024-06-13/#theoretical-frameworks","title":"Theoretical frameworks","text":"Key message
Each framework is built upon different assumptions about space, contacts, and transmission.
Cam introduced three theoretical frameworks for disease transmission, which he used in constructing infectious disease models for three different projects. Note that all three models use the same within-host model for individual dynamics.
Border quarantine for COVID-19: international arrivals, quarantine workers, and the local community are divided into mixing groups within which there is close contact. There is also weaker contact between these mixing goups.
Social isolation in residential aged care facilites: the transmission is a multigraph that explicitly simulates contact between individuals. The graph is dynamic: workers and worker-room assignments can change every day. Workers share N edges when they service N rooms in common.
A single hospital ward (work in progress): a shared space model represents spatial structure as a network of separate spaces (i.e., nodes). Nurses and patients are associated with spaces according to schedules. Each space has its own viral concentration, driven by shedding from infectious people and ventilation (the air changes around 6 times per hour). Residence in a space results in a net viral dose, which confers a probability of infection (using the Wells-Riley model).
Question
Are many short interactions equivalent to one long interaction?
"},{"location":"community/meetings/2024-06-13/#pros-and-cons-of-model-structures","title":"Pros and cons of model structures","text":"Key message
Each framework has unique strengths and weaknesses.
As shown in the slide below, Cam identified a range of pros and cons for each modelling framework. Some of the key trade-offs between these frameworks are:
The ease of validation (aged care and hospital ward) versus the ease of communication (quarantine);
Having explicit physical parameters and units (hospital ward) versus having vague and/or phenomenological parameters (quarantine and aged care); and
Being simple to construct and efficient to run (quarantine and aged care) versus being complex to construct and computationally expensive (hospital ward).
Key message
Complex models typically have complex data requirements.
Data requirements can also present a challenge when constructing complex models. For example, behaviour models are good for highly-structured environments such as hospital wards, where nurses have scheduled tasks that are performed in specific spaces. However, the data required to construct the behaviour model can be very hard to collect, access, and use. Even if nurses wear sensors, the sensor data are never sufficiently clean or complete to use without substantial cleaning and processing.
Airflow between spaces in a highly-structured environment is also complex to model. Air exchange can involve diffusion between adjacent spaces, but also airflow between non-adjacent spaces through ventilation systems. These flows can be difficult to identify, and are computationally expensive to simulate (computational fluid dynamics).
Cam concluded by observing that existing hospitals wards tend to have a design flaw for infection control:
There are many shared spaces in which infection can spread among individuals via air transmission.
"},{"location":"community/meetings/2024-06-13/#reproducibility-in-stochastic-models","title":"Reproducibility in stochastic models","text":"Key message
These models rely on random number generators (RNGs), whose outputs can be controlled by defining their initial seed. Using a separate RNG for each process in the model provides further advantages (see below).
In contrast to agent-based models of much larger populations, these models are small enough that they can be run as single-threaded code, and multiple simulations can be run in parallel. The bulk of computational cost is usually sweeping over many combinations of parameter values.
The aged care (multigraph) and hospital ward (shared space) models decouple the population RNG from the transmission dynamics RNG. An advantage of using multiple RNGs is that we can independently control and modify these processes. For example, by using separate RNGs for infections and testing, we can adjust testing practices without affecting the infection process.
"},{"location":"community/meetings/2024-06-13/#topic-for-a-masters-project","title":"Topic for a Masters project","text":"Question
Does anyone know a suitable Masters student?
Cam is looking for a Masters student to undertake a project that will look at individual-level counterfactual scenarios. The key idea is to identify sets of preconditions (e.g., salient details of the event history and/or current epidemic context) and ensure that the model will always generate the same outcome when given these preconditions. An open question is how far back in the event history is necessary/sufficient.
"},{"location":"community/meetings/2024-07-11/","title":"11 July 2094","text":""},{"location":"community/meetings/2024-07-11/#nefel-tellioglu-lessons-learned-from-pneumococcal-vaccine-modelling","title":"Nefel Tellioglu: Lessons learned from pneumococcal vaccine modelling","text":"In this meeting Nefel gave a presentation about a pneumococcal vaccine (PCV) evaluation project for government, sharing her experiences in developing a model from scratch under tight deadlines.
Attendance: 6 in person, 6 online.
Info
We welcome presentations about research projects and experiences that relate in any way to reproducibility and good computational research practices. Presentations can be short, informal, and free-form.
Please contact us if you have anything you might like to present!
"},{"location":"community/meetings/2024-07-11/#computational-performance","title":"Computational performance","text":"Key message
Optimisation is a skill that takes time to learn.
This project involved constructing an agent-based model (ABM) of pneumococcal disease, incorporating various vaccination assumptions and intervention strategies. Nefel was familiar with an existing ABM framework written in Python, but found that the project requirements (a large population size and long simulation time-frames) meant that a different approach was required.
Asking for help in a new skill: model optimisation for each vaccine type and multi-strains
They ended up implementing a model from scratch, using the Polars data frame library to represent each individual as a separate row in a single population data frame. This library is designed for high performance, and Nefel was able to implement a model that ran quickly enough for the purposes of this project.
An introduction to Polars workshop?
Nefel asked whether other people would be interested in an \"Introduction to Polars\" workshop, and a number of participants indicated interest.
"},{"location":"community/meetings/2024-07-11/#workflows-and-deadlines","title":"Workflows and deadlines","text":"Key message
Using version control makes it much easier to fix your code when it breaks.
Nefel made frequent use of a git repository (hosted on GitHub) in the early stages of the project. She found it very useful during the model prototyping phase, when adding new features frequently broke the code in some way. Having immediate access to previous versions of the code made it much easier to revert changes and fix the code.
However, she stopped using it when the project reached a series of tight deadlines.
"},{"location":"community/meetings/2024-07-11/#asking-for-extensions","title":"Asking for extensions","text":"Key message
Being able to provide advance warning of potential delays, and to explain the reasons why they might occur, is extremely helpful for everyone. This allows project leaders and stakeholders to adjust their plans and expectations.
It's generally hard to estimate feasible timelines in advance. This is especially difficult when exploring a new problem, and when a range of future extensions are anticipated.
These kinds of conversations can feel extremely uncomfortable. Several participants reflected on their own experiences, and agreed that informing their supervisors about potential problems as early as possible was the best approach.
Things can take longer than expected due to the research nature of building a new infectious disease model. Where possible, avoid promising that a model will be completed by a certain time. Instead, give stakeholders regular updates about progress and challenges, so that they can appreciate how much effort that is being applied to the problem.
Gizem: stakeholders may not know what they want or need from the model. It's really helpful to clarify this early in the project, which needs a good working relationship.
Eamon: writing your code in a modular way can help make it easier to implement those future extensions. Experience also helps in designing your code so that future extensions only modify small parts of your model. But avoid trying to make your code as abstract and extensible as possible.
Rob: if you know that the model will be applied to many different scenarios in the future, try to separate the code that defines the location of data files from the code that uses those data. That makes it easier to run your model using different sets of input data.
"},{"location":"community/meetings/2024-07-11/#related-libraries-for-python-and-r","title":"Related libraries for Python and R","text":"Key message
There are a number of high-performance data frame libraries.
Polars primarily supports Python, Rust, and JavaScript. There is also an R package that has several extensions, including:
polarssql: a Polars backend for DBI and dbplyr; and
tidypolars: tidyverse syntax for Polars.
Other high-performance data frame options for R:
data.table: a high-performance data.frame
replacement;
DBI: a package for working with various databases; and
dbplyr: a database backend for dplyr.
DuckDB is another high-performance library for working with databases and tabular data, and is available for many languages including R, Python, and Julia. It also integrates with Polars, allowing you to query Polars data frames and to save outputs as Polars data frames.
"},{"location":"community/meetings/2024-07-11/#conclusions","title":"Conclusions","text":"Key message
Once a project is completed, it's worth reflecting on what worked well, and on what you would do differently next time.
Nefel finished by reflecting on what she might do differently next time, and highlighting two key points:
Begin with a clearer understanding of the skills required for the project, such as modelling large populations and code optimisation.
Where there are potential skill gaps, involve other people in the project who can contribute relevant expertise.
At our next meeting \u2014 currently scheduled for Thursday 8 August \u2014 we will work on finalising our Orientation Guide checklist, collect supporting materials for each item on the checklist, and begin drafting content where no suitable supporting materials can be found.
"},{"location":"community/meetings/2024-08-08/","title":"8 August 2024","text":""},{"location":"community/meetings/2024-08-08/#orientation-guide","title":"Orientation guide","text":"Key message
The aim for the Orientation Guide is to provide a short overview of import concepts, useful tools and resources, and links to relevant tutorials and examples.
In this meeting we discussed how the Orientation Guide could best address the needs of new students and staff. We began by asking participants what skills, tools, and knowledge they've found to be particularly useful, and wish they'd discovered earlier.
Attendance: 5 in person, 2 online.
"},{"location":"community/meetings/2024-08-08/#core-tools-and-recommended-packages","title":"Core tools and recommended packages","text":"Key message
There was strong interest in having opinionated recommendations for helpful software packages and libraries, based on our own experiences.
When we start out, we typically don't know what tools are available and how to choose between them. So having guidance and recommendations from more experienced members of our community can be valuable.
This harks back to TK's presentation and their reflections on choosing the best tools for a specific project or task. For example:
Question
Which editor should a new student use for their project?
We strongly recommend choosing an editor that can automatically format and check your code.
Eamon suggested that in addition to linking to tutorials for installing common tools such as Python and R, the orientation guide should recommend helpful packages. For example:
For R, the tidyr
package and the broader collection of \"tidyverse\" packages.
For Python, Conda is probably the easiest way to install Python and scientific/numeric Python libraries on Windows and OS X (it also supports Linux).
Jacob: it would be nice to have a flowchart or diagram to help identify relevant tools and packages. For example, if you want to (a) analyse tabular data; and (b) use Python; then what package would you recommend? (Our answer was Polars).
"},{"location":"community/meetings/2024-08-08/#reproducible-environments","title":"Reproducible environments","text":"Key message
Virtual environments allow you to install the packages that are required for a specific project, without affecting other projects.
This is useful for a number of reasons, including:
Being able to manage each project independently and in isolation;
Being able to use different versions of packages in different projects; and
Making it simpler to set up and run a project on a new computer.
Python provides built-in tools for virtual environments, and the University of Melbourne's Python Digital Skills Training includes a workshop on Python Virtual Environments.
For R, the renv
packages provides similar capabilities, and the Introduction to renv article provides a good overview.
Key message
Reproducible document formats such as RMarkdown (for R) and Jupyter Notebooks (for Python) provide a way to combine code, data, text, and figures in self-contained and reproducible plain-text files.
For introductions and examples, see:
Nick Tierney's RMarkdown for Scientists;
The Jupyter Notebook documentation; and
The Quarto publishing system is similar to RMarkdown, but allows you to write code in Python, R, and/or Julia, and supports many different output formats.
If you use VS Code to write Quarto documents, when you edit a code block it will open it in Jupyter (for Python code) and this allows you to step through and debug the code to some degree.
"},{"location":"community/meetings/2024-08-08/#existing-training-courses-and-needs","title":"Existing training courses and needs","text":"Key message
There will be an Introduction to Polars workshop at the SPECTRUM 2024 annual meeting (23-25 September), led by Nefel Tellioglu and Julian Carlin.
We asked participants if they had found any training materials that were particularly useful.
Mackrina said that she is using Python in her PhD project, but previously only had experience with Matlab.
Mackrina completed several online Python courses, but found that an in-person training session offered by the University of Melbourne's Digital Skills Training was the most useful. They regularly run a wide range of training sessions, see the list of upcoming sessions.
Mackrina found that the pypfilt package made it much easier to write ordinary differential equation (ODE) models and run scenario simulations. Note: Rob is the pypfilt
author and maintainer. This package is designed for scenario modelling, and model-fitting using Sequential Monte Carlo (SMC) methods.
Other participants chimed in with recommended resources and training needs:
Rob: The Software Carpentry provides a good range of lessons.
TK: I want to learn how to use Docker and containers, and how to (install and) use greta.
Eamon: I'm happy to provide assistance and guidance with using Stan to fit models using Hamiltonian Monte Carlo.
Jiahao: Python ODE solvers are not described nearly as well as Matlab's ODE solvers, so they are harder to use.
Using GPGPUs for high-performance computing
Jiahao asked: Does anyone in our community have experience with using GPGPUs?
In response to Jaihao's question, Eamon replied that he has found it to be near-impossible, due to a combination of:
This initiated a broader discussion about improving the computational performance of our code and making use of high-performance computing (HPC) resources.
Computational performance was an issue that Nefel encountered when constructing an agent-based model of pneumococcal disease, and she found that code optimisation is a skill that takes time to learn.
We discussed several ways about using multiple CPU cores to make code run more quickly:
Using libraries that automatically make use of multiple CPU cores, such as future.apply for R, concurrent.futures for Python, and the Polars data-frame library.
Where we want to run large numbers of simulations, the easiest approach can often be for each simulation to only use one CPU core, and to run many simulations in parallel (e.g., on virtual machines that have many CPU cores). However, as TK pointed out, if each simulation uses a large amount of RAM, it may not be possible to run many simulations in parallel with this approach.
For larger scale problems, there are HPC platforms such as the University of Melbourne's Spartan and Monash University's MASSIVE. Using these platforms typically requires writing some bespoke code to define and schedule jobs.
Key message
There was strong interest in running a debugging workshop at the upcoming SPECTRUM 2024 annual meeting (23-25 September). As TK and Nefel have shown in their presentations, skills like debugging are extremely valuable for projects with tight deadlines, but these projects are also the worst time in which to develop and practice these skills.
Info
Attendees confirmed their willingness to evaluate and provide feedback on workshop draft materials.
Rob reflected that many people struggle to effectively debug their code, and can end up wasting a lot of time. Since we all make mistakes when writing code, this can be an extremely valuable skill. This is particularly true when working on, e.g., modelling contracts with government (see, e.g., the recent presentations from TK and Nefel).
We discussed some general guidelines, such as:
Structuring your code so that it's easier to debug (e.g., small functions);
Avoid hard-coding numerical values, file names, etc, as much as possible;
Making use of breakpoints rather than print()
statements;
Checking that input arguments to a function satisfy necessary conditions;
Checking that output values from a function satisfy necessary conditions;
Failing early (e.g., raising exceptions in Python, calling stop()
in R) rather than returning values that may not be valid.
Eamon: by learning how to debug code, I substantially improved how I write and modularise my code. My functions became smaller, and this helped me to make fewer mistakes.
For example, David Price and I encountered a bug in some C++ code where the function was correct, but made assumptions about the input arguments. These assumptions were initially satisfied, but as other parts of the code were updated, these assumptions were no longer true.
To address this, I often write if
statements at the top of a function to check these kinds of conditions, and stop if there are failures. You can see examples of this in real-world code from popular packages.
James: I'm happy to provide an example of debugging within an R pipe. Learn Debugging might be a useful resource for our community.
Rob: failing early is good, rather than producing output and then having to discover that it's incorrect (which may not be obvious). Related skills include learning how to read a stack trace, and defensive programming (such as checking input arguments, as Eamon mentioned).
TK: it's really hard to change existing habits. And I'm not doing any coding in my projects right now. My most recent coding experiences were in COVID-19 projects (see TK's presentation) and the very tight deadlines didn't allow me the opportunity to develop and apply new skills.
Rob: everyone already debugs and tests their code to some degree, simply by writing and evaluating code line by line (e.g., in an interactive R or Python session) and by running functions with example arguments to check that they give sensible outputs. We just need to \"nudge\" these behaviour to make it more systematic and reproducible.
"},{"location":"community/training/","title":"Training events","text":"We will be running an Introduction to Debugging workshop at the SPECTRUM Annual Meeting 2024 (23-25 September).
"},{"location":"community/training/debugging/","title":"Introduction to Debugging","text":"This workshop was prepared for the SPECTRUM Annual Meeting 2024 (23-25 September).
Tip
We all make mistakes when writing code and introduce errors.
Having good debugging skills means that you can spend less time fixing your code.
See the discussion in our August 2024 meeting for further background.
"},{"location":"community/training/debugging/building-your-skills/","title":"Building your skills","text":"Tip
Whenever you debug some code, consider it as an opportunity to learn, reflect, and build your debugging skills.
Pay attention to your experience \u2014 what worked well, and what would you do differently next time?
"},{"location":"community/training/debugging/building-your-skills/#identifying-errors","title":"Identifying errors","text":"Write a failing test case, this allows you to verify that the bug can be reproduced.
"},{"location":"community/training/debugging/building-your-skills/#developing-a-plan","title":"Developing a plan","text":"What information might help you decide how to begin?
Can you identify a recent \"known good\" version of the code that doesn't include the error?
If you're using version control, have a look at your recent commits and check whether any of them are likely to have introduced or exposed this error.
"},{"location":"community/training/debugging/building-your-skills/#searching-for-the-root-cause","title":"Searching for the root cause","text":"We've shown how a debugger allows you to pause your code and see what it's actually doing. This is extremely helpful!
Tip
Other approaches may be useful, but avoid using trial-and-error.
To quickly confirm or rule out specific suspicions, you might consider using:
print()
statements;assert()
to verify whether specific conditions are met;git bisect
to identify the commit that introduced the error.Is there an optimal solution?
This might be the solution that changes as little code as possible, or it might be a solution that involves modifying and/or restructuring other parts of your code.
"},{"location":"community/training/debugging/building-your-skills/#after-its-fixed","title":"After it's fixed","text":"If you didn't write a test case to identify the error (see above), now is the time to write a test case to ensure you don't even make the same error again.
Are there other parts of your code where you might make a similar mistake, for which you could also write test cases?
Are there coding practices that might make this kind of error easier to find next time? For example, this might involve dividing your code into smaller functions, using version control to record commits early and often.
Have you considered defensive programming practices? For example, at the start of a function it can often be a good idea to check that all of the arguments have valid values.
Are there tools or approaches that you haven't used before, and which might be worth trying next time?
"},{"location":"community/training/debugging/example-square-numbers/","title":"Example: Square numbers","text":"Square numbers are positive integers that are equal to the square of an integer. Here we have provided example Python and R scripts that print all of the square numbers between 1 and 100:
You can download these scripts to run on your own computer:
Each script contains three functions:
main()
find_squares(lower_bound, upper_bound)
is_square(value)
The diagram below shows how main()
calls find_squares()
, which in turn calls is_square()
many times.
sequenceDiagram\n participant M as main()\n participant F as find_squares()\n participant I as is_square()\n activate M\n M ->>+ F: lower_bound = 1, upper_bound = 100\n Note over F: squares = [ ]\n F ->>+ I: value = 1\n I ->>- F: True/False\n F ->>+ I: value = 2\n I ->>- F: True/False\n F -->>+ I: ...\n I -->>- F: ...\n F ->>+ I: value = 100\n I ->>- F: True/False\n F ->>- M: squares = [...]\n Note over M: print(squares)\n deactivate M
Source code PythonR square_numbers.py#!/usr/bin/env python3\n\"\"\"\nPrint the square numbers between 1 and 100.\n\"\"\"\n\n\ndef main():\n squares = find_squares(1, 100)\n print(squares)\n\n\ndef find_squares(lower_bound, upper_bound):\n squares = []\n value = lower_bound\n while value <= upper_bound:\n if is_square(value):\n squares.append(value)\n value += 1\n return squares\n\n\ndef is_square(value):\n for i in range(1, value + 1):\n if i * i == value:\n return True\n return False\n\n\nif __name__ == '__main__':\n main()\n
square_numbers.R#!/usr/bin/env -S Rscript --vanilla\n#\n# Print the square numbers between 1 and 100.\n#\n\n\nmain <- function() {\n squares <- find_squares(1, 100)\n print(squares)\n}\n\n\nfind_squares <- function(lower_bound, upper_bound) {\n squares <- c()\n value <- lower_bound\n while (value <= upper_bound) {\n if (is_square(value)) {\n squares <- c(squares, value)\n }\n value <- value + 1\n }\n squares\n}\n\n\nis_square <- function(value) {\n for (i in seq(value)) {\n if (i * i == value) {\n return(TRUE)\n }\n }\n FALSE\n}\n\nif (! interactive()) {\n main()\n}\n
"},{"location":"community/training/debugging/example-square-numbers/#stepping-through-the-code","title":"Stepping through the code","text":"These recorded terminal sessions demonstrate how to use Python and R debuggers from the command line. They cover:
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Python debuggerR debuggerVideo timeline:
is_square()
is_square()
squares
listVideo timeline:
is_square()
is_square()
squares
listPerfect numbers are positive integers that are equal to the sum of their divisors. Here we have provided example Python and R scripts that should print all of the perfect numbers up to 1,000.
You can download each script to debug on your own computer:
#!/usr/bin/env python3\n\"\"\"\nThis script prints perfect numbers.\n\nPerfect numbers are positive integers that are equal to the sum of their\ndivisors.\n\"\"\"\n\n\ndef main():\n start = 2\n end = 1_000\n for value in range(start, end + 1):\n if is_perfect(value):\n print(value)\n\n\ndef is_perfect(value):\n divisors = divisors_of(value)\n return sum(divisors) == value\n\n\ndef divisors_of(value):\n divisors = []\n candidate = 2\n while candidate < value:\n if value % candidate == 0:\n divisors.append(candidate)\n candidate += 1\n return divisors\n\n\nif __name__ == '__main__':\n main()\n
perfect_numbers.R#!/usr/bin/env -S Rscript --vanilla\n#\n# This script prints perfect numbers.\n#\n# Perfect numbers are positive integers that are equal to the sum of their\n# divisors.\n#\n\n\nmain <- function() {\n start <- 2\n end <- 1000\n for (value in seq.int(start, end)) {\n if (is_perfect(value)) {\n print(value)\n }\n }\n}\n\n\nis_perfect <- function(value) {\n divisors <- divisors_of(value)\n sum(divisors) == value\n}\n\n\ndivisors_of <- function(value) {\n divisors <- c()\n candidate <- 2\n while (candidate < value) {\n if (value %% candidate == 0) {\n divisors <- c(divisors, candidate)\n }\n candidate <- candidate + 1\n }\n divisors\n}\n\n\nmain()\n
But there's a problem ...
If we run these scripts, we see that they don't print anything:
How should we begin investigating?
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Some initial thoughts ...Are we actually running the main()
function at all?
The main()
function is almost certainly not the cause of this error.
The is_perfect()
function is very simple, so it's unlikely to be the cause of this error.
The divisors_of()
function doesn't look obviously wrong.
But there must be a mistake somewhere!
Let's use a debugger to investigate.
Here we have provided SIR ODE model implementations in Python and in R. Each script runs several scenarios and produces a plot of infection prevalence for each scenario.
You can download each script to debug on your computer:
#!/usr/bin/env python3\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom scipy.integrate import solve_ivp\n\n\ndef sir_rhs(time, state, popn, beta, gamma):\n \"\"\"\n The right-hand side for the vanilla SIR compartmental model.\n \"\"\"\n s_to_i = beta * state[1] * state[0] / popn # beta * I(t) * S(t) / N\n i_to_r = gamma * state[1] # gamma * I(t)\n return [-s_to_i, s_to_i - i_to_r, i_to_r]\n\n\ndef run_model(settings):\n \"\"\"\n Return the SIR ODE solution for the given model settings.\n \"\"\"\n # Define the time span and evaluation times.\n sim_days = settings['sim_days']\n time_span = [0, sim_days]\n times = np.linspace(0, sim_days, num=sim_days + 1)\n # Define the initial state.\n popn = settings['population']\n exposures = settings['exposures']\n initial_state = [popn - exposures, exposures, 0]\n # Define the SIR parameters.\n params = (popn, settings['beta'], settings['gamma'])\n # Return the daily number of people in S, I, and R.\n return solve_ivp(\n sir_rhs, time_span, initial_state, t_eval=times, args=params\n )\n\n\ndef default_settings():\n \"\"\"\n The default model settings.\n \"\"\"\n return {\n 'sim_days': 20,\n 'population': 100,\n 'exposures': 2,\n 'beta': 1.0,\n 'gamma': 0.5,\n }\n\n\ndef run_model_scaled_beta(settings, scale):\n \"\"\"\n Adjust the value of ``beta`` before running the model.\n \"\"\"\n settings['beta'] = scale * settings['beta']\n return run_model(settings)\n\n\ndef run_model_scaled_gamma(settings, scale):\n \"\"\"\n Adjust the value of ``gamma`` before running the model.\n \"\"\"\n settings['gamma'] = scale * settings['gamma']\n return run_model(settings)\n\n\ndef plot_prevalence_time_series(solutions):\n \"\"\"\n Plot daily prevalence of infectious individuals for one or more scenarios.\n \"\"\"\n fig, axes = plt.subplots(\n constrained_layout=True,\n nrows=len(solutions),\n sharex=True,\n sharey=True,\n )\n for ix, (scenario_name, solution) in enumerate(solutions.items()):\n ax = axes[ix]\n ax.title.set_text(scenario_name)\n ax.plot(solution.y[1], label='I')\n ax.set_xticks([0, 5, 10, 15, 20])\n # Save the figure.\n png_file = 'sir_ode_python.png'\n fig.savefig(png_file, format='png', metadata={'Software': None})\n\n\ndef demonstration():\n settings = default_settings()\n default_scenario = run_model(settings)\n scaled_beta_scenario = run_model_scaled_beta(settings, scale=1.5)\n scaled_gamma_scenario = run_model_scaled_gamma(settings, scale=0.7)\n\n plot_prevalence_time_series(\n {\n 'Default': default_scenario,\n 'Scaled Beta': scaled_beta_scenario,\n 'Scaled Gamma': scaled_gamma_scenario,\n }\n )\n\n\nif __name__ == '__main__':\n demonstration()\n
sir_ode.R#!/usr/bin/env -S Rscript --vanilla\n\nlibrary(deSolve)\nsuppressPackageStartupMessages(library(dplyr))\nsuppressPackageStartupMessages(library(ggplot2))\n\n\n# The right-hand side for the vanilla SIR compartmental model.\nsir_rhs <- function(time, state, params) {\n s_to_i <- params$beta * state[\"I\"] * state[\"S\"] / params$popn\n i_to_r <- params$gamma * state[\"I\"]\n list(c(-s_to_i, s_to_i - i_to_r, i_to_r))\n}\n\n\n# Return the SIR ODE solution for the given model settings.\nrun_model <- function(settings) {\n # Define the time span and evaluation times.\n times <- seq(0, settings$sim_days)\n # Define the initial state.\n popn <- settings$population\n exposures <- settings$exposures\n initial_state <- c(S = popn - exposures, I = exposures, R = 0)\n # Define the SIR parameters.\n params <- list(\n popn = settings$population,\n beta = settings$beta,\n gamma = settings$gamma\n )\n # Return the daily number of people in S, I, and R.\n ode(initial_state, times, sir_rhs, params)\n}\n\n\n# The default model settings.\ndefault_settings <- function() {\n list(\n sim_days = 20,\n population = 100,\n exposures = 2,\n beta = 1.0,\n gamma = 0.5\n )\n}\n\n\n# Adjust the value of ``beta`` before running the model.\nrun_model_scaled_beta <- function(settings, scale) {\n settings$beta <- scale * settings$beta\n run_model(settings)\n}\n\n\n# Adjust the value of ``gamma`` before running the model.\nrun_model_scaled_gamma <- function(settings, scale) {\n settings$gamma <- scale * settings$gamma\n run_model(settings)\n}\n\n\n# Plot daily prevalence of infectious individuals for one or more scenarios.\nplot_prevalence_time_series <- function(solutions) {\n df <- lapply(\n names(solutions),\n function(name) {\n solutions[[name]] |>\n as.data.frame() |>\n mutate(scenario = name)\n }\n ) |>\n bind_rows() |>\n mutate(\n scenario = factor(scenario, levels = names(solutions), ordered = TRUE)\n )\n fig <- ggplot() +\n geom_line(aes(time, I), df) +\n xlab(NULL) +\n scale_y_continuous(\n name = NULL,\n limits = c(0, 40),\n breaks = c(0, 20, 40)\n ) +\n facet_wrap(~ scenario, ncol = 1) +\n theme_bw(base_size = 10) +\n theme(\n strip.background = element_blank(),\n panel.grid = element_blank(),\n )\n png_file <- \"sir_ode_r.png\"\n ggsave(png_file, fig, width = 640, height = 480, units = \"px\", dpi = 150)\n}\n\n\ndemonstration <- function() {\n settings <- default_settings()\n default_scenario <- run_model(settings)\n scaled_beta_scenario <- run_model_scaled_beta(settings, scale=1.5)\n scaled_gamma_scenario <- run_model_scaled_gamma(settings, scale=0.7)\n\n plot_prevalence_time_series(\n list(\n Default = default_scenario,\n `Scaled Beta` = scaled_beta_scenario,\n `Scaled Gamma` = scaled_gamma_scenario\n )\n )\n}\n\ndemonstration()\n
The model outputs differ!
Here are prevalence time-series plots produced by each script:
Python plotR plotModel outputs for the Python script.
Model outputs for the R script.
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Some initial thoughts ...Is it obvious whether one of the figures is correct and the other is wrong?
The sir_rhs()
functions in the two scripts appear to be equivalent \u2014 but are they?
The default_settings()
functions appear to be equivalent \u2014 but are they?
The run_model_scaled_beta()
and run_model_scaled_gamma()
functions also appear to be equivalent.
Where might you begin looking?
In this workshop, we will introduce the concept of \"debugging\", and demonstrate techniques and tools that can help us efficiently identify and remove errors from our code.
After completing this workshop, participants will:
Understand that debugging can be divided into a sequence of actions;
Understand the purpose of each of these actions;
Be familiar with techniques and tools that can help perform these actions;
Be able to apply these techniques and tools to their own code.
Info
By achieving these learning objectives, participants should be able to find and correct errors in their code more quickly and with greater confidence.
"},{"location":"community/training/debugging/manifesto/","title":"Debugging manifesto","text":"Julia Evans and Tanya Brassie: Debugging Manifesto Poster, 2024.Info
See the Resources page for links to more of Julia Evans' articles, stories, and zines about debugging.
"},{"location":"community/training/debugging/resources/","title":"Resources","text":"Info
Please don't look as these solutions until you have attempted the exercises.
Perfect numbersPerfect numbers are equal to the sum of their proper divisors \u2014 all divisors except the number itself.
For example, 6 is a perfect number. Its proper divisors are 1, 2, and 3, and 1 + 2 + 3 = 6.
The mistake here is that the divisors_of()
function only returns divisors greater than one, and so the code fails to identify any of the true perfect numbers.
Interestingly, this mistake did not result in the code mistakenly identifying any other numbers as perfect numbers.
Python vs RIf you're only familiar with one of these two languages, you may be surprised to discover that they have some fundamental differences. In this exercise we demonstrated one consequence of the ways that these languages handle function arguments.
The R Language Definition
The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame.
\u2014 Argument Evaluation
Python Programming FAQ
Remember that arguments are passed by assignment in Python. Since assignment just creates references to objects, there's no alias between an argument name in the caller and callee, and so no call-by-reference per se.
\u2014 How do I write a function with output parameters (call by reference)?
In R the run_model_scaled_beta()
and run_model_scaled_gamma()
functions do not modify the value of settings
in the demonstration()
function. This produces model outputs for the following parameter combinations:
In Python the run_model_scaled_beta()
and run_model_scaled_gamma()
functions do modify the value of settings
in the demonstration()
function. This produces model outputs for the following parameter combinations:
Answer
The value of \u03b2 is different in the third combination.
"},{"location":"community/training/debugging/understanding-error-messages/","title":"Understanding error messages","text":"Tip
The visible error and its root cause may be located in very different parts of your code.
If there's an error in your code that causes the program to terminate, read the error message and see what it can tell you.
Most of the time, the error message should allow to identify:
What went wrong? For example, did it try to read data from a file that does not exist?
Where did this happen? On which line of which file did the error occur?
When an error occurs, one useful piece of information is knowing which functions were called in order to make the error occur.
Below we have example Python and R scripts that produce an error.
Question
Can you identify where the error occurred, just by looking at the error message?
OverviewPythonRYou can download each script and run them on your own computer:
Traceback (most recent call last):\n File \"stacktrace.py\", line 46, in <module>\n status = main()\n File \"stacktrace.py\", line 7, in main\n do_big_tasks()\n File \"stacktrace.py\", line 17, in do_big_tasks\n do_third_step(i, quiet=quiet)\n File \"stacktrace.py\", line 38, in do_third_step\n try_something()\n File \"stacktrace.py\", line 42, in try_something\n raise ValueError(\"Whoops, this failed\")\nValueError: Whoops, this failed\n
Source code stacktrace.py#!/usr/bin/env python3\n\nimport sys\n\n\ndef main():\n do_big_tasks()\n return 0\n\n\ndef do_big_tasks(num_tasks=20, quiet=True):\n for i in range(num_tasks):\n prepare_things(i, quiet=quiet)\n do_first_step(i, quiet=quiet)\n do_second_step(i, quiet=quiet)\n if i > 15:\n do_third_step(i, quiet=quiet)\n\n\ndef prepare_things(task_num, quiet=True):\n if not quiet:\n print(f'Preparing for task #{task_num}')\n\n\ndef do_first_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #1')\n\n\ndef do_second_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #2')\n\n\ndef do_third_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #3')\n try_something()\n\n\ndef try_something():\n raise ValueError(\"Whoops, this failed\")\n\n\nif __name__ == \"__main__\":\n status = main()\n sys.exit(status)\n
"},{"location":"community/training/debugging/understanding-error-messages/#the-error-message_1","title":"The error message","text":"Error in try_something() : Whoops, this failed\nCalls: main -> do_big_tasks -> do_third_step -> try_something\nBacktrace:\n \u2586\n 1. \u2514\u2500global main()\n 2. \u2514\u2500global do_big_tasks()\n 3. \u2514\u2500global do_third_step(i, quiet = quiet)\n 4. \u2514\u2500global try_something()\nExecution halted\n
Source code stacktrace.R#!/usr/bin/env -S Rscript --vanilla\n\noptions(error = rlang::entrace)\n\n\nmain <- function() {\n do_big_tasks()\n invisible(0)\n}\n\ndo_big_tasks <- function(num_tasks = 20, quiet = TRUE) {\n for (i in seq_len(num_tasks)) {\n prepare_things(i, quiet = quiet)\n do_first_step(i, quiet = quiet)\n do_second_step(i, quiet = quiet)\n if (i > 15) {\n do_third_step(i, quiet = quiet)\n }\n }\n}\n\nprepare_things <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Preparing for task #\", task_num, \"\\n\", sep = \"\")\n }\n}\n\ndo_first_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #1\\n\", sep = \"\")\n }\n}\n\ndo_second_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #2\\n\", sep = \"\")\n }\n}\n\ndo_third_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #3\\n\", sep = \"\")\n }\n try_something()\n}\n\ntry_something <- function() {\n stop(\"Whoops, this failed\")\n}\n\nif (! interactive()) {\n status <- main()\n quit(status = status)\n}\n
"},{"location":"community/training/debugging/using-a-debugger/","title":"Using a debugger","text":"The main features of a debugger are:
Breakpoints: pause the program when a particular line of code is about to be executed;
Display/print: show the current value of local variables;
Next: execute the current line of code and pause at the next line;
Continue: continue executing code until the next breakpoint, or the code finishes.
Slightly more advanced features include:
Conditional breakpoints: pause the program when a particular line of code is about to be executed and a specific condition is satisfied.
Step: execute the current line of code and pause at the first possible point \u2014 either the line in the current function or the first line in a function that is called.
For example, consider the following code example:
PythonRdef first_function():\n total = 0\n for x in range(1, 50):\n y = second_function(x)\n total = total + y\n\n return total\n\n\ndef second_function(a):\n result = 3 * a**2 + 5 * a\n return result\n\n\nfirst_function()\n
first_function <- function() {\n total <- 0\n for (x in seq(49)) {\n y <- second_function(x)\n total <- total + y\n }\n total\n}\n\nsecond_function <- function(a) {\n result <- 3 * a^2 + 5 * a\n result\n}\n\nfirst_function()\n
We can use a conditional breakpoint to pause on line 4 (highlighted) only when x = 42
.
We can then use step to begin executing line 4 and pause on line 11, where we will see that a = 42
.
If we instead used next at line 4 (highlighted), the debugger would execute line 4 and then pause on line 5.
Info
Debugging is the process of identifying and removing errors from computer software.
You need to identify (and reproduce) the problem and only then begin fixing it (ideally writing a test case first, to check that (a) you can identify the problem; and (b) to identify if you accidentally introduce the same, or similar, mistake in the future).
"},{"location":"community/training/debugging/what-is-debugging/#action-1-identify-the-error","title":"Action 1: Identify the error","text":"Tip
First make sure that you can reproduce the error.
What observations or outputs indicate the presence of this error?
Is the error reproducible, or does it come and go?
Write a failing test?
"},{"location":"community/training/debugging/what-is-debugging/#action-2-develop-a-plan","title":"Action 2: Develop a plan","text":"Tip
The visible error and its root cause may be located in very different parts of your code.
Identify like and unlikely suspects, what can we rule in/out? What parts of your code recently changed? When was the last time you might have noticed this error?
"},{"location":"community/training/debugging/what-is-debugging/#action-3-search-for-the-root-cause","title":"Action 3: Search for the root cause","text":"Tip
As much as possible, the search should be guided by facts, not assumptions.
Our assumptions about the code can help us to develop a plan, but we need to verify whether our assumptions are actually true.
For example:
Simple errors can often hide\nhide in plain sight and be\nsurprisingly difficult to\ndiscover without assistance.\n
Thinking \"this looks right\" is not a reliable indicator of whether a piece of code contains an error.
Searching at random is like looking for a needle in a haystack. (Perry McKenna, Flickr, 2009; CC BY 2.0)Better approaches involve confirming what the code is actually doing.
This can be done using indirect approaches, such as adding print statements or writing test cases.
It can also be done by directly inspecting the state of the program with a debugger \u2014 more on that shortly!
Tip
It's worth considering if the root cause is a result of deliberate decisions or unintentional mistakes.
Don't start modifying/adding/removing lines based on suspicions or on the off chance that it might work. Without identifying the root cause of the error, there is no guarantee that making the error seem to disappear will actually have fixed the root cause.
"},{"location":"community/training/debugging/what-is-debugging/#action-5-after-its-fixed","title":"Action 5: After it's fixed","text":"Tip
This is the perfect time to reflect on your experience!
What can you learn from this experience? Can you avoid this mistake in the future? What parts of the process were the hardest or took the longest? Are the tools and techniques that might help you next time?
"},{"location":"community/training/debugging/why-are-debuggers-useful/","title":"Why are debuggers useful?","text":"Tip
A debugger is a tool for examining the state of a running program.
Debuggers are useful because they show us what the code is actually doing.
Many of the errors that take a long time for us to find are relatively simple once we find them.
We usually have a hard time finding these errors because:
We read what we expect to see, rather than what is actually written; and
We rely on assumptions about where the mistake might be, and our intuition is often wrong.
Here are some common mistakes that can be difficult to identify when reading through your own code:
Using an incorrect index into an array, matrix, list, etc;
Using incorrect bounds on a loop or sequence;
Confusing the digit \"1\" with letter \"l\";
Confusing the digit \"0\" with letter \"O\".
These materials are divided into the following sections:
Understanding version control, which provides you with a complete and annotated history of your work, and with powerful ways to search and examine this history.
Learning to use Git, the most widely used version control system, which is the foundation of popular code-hosting services such as GitHub, GitLab, and Bitbucket.
Using Git to collaborate with colleagues in a precisely controlled and manageable way.
Learn how to structure your project so that it is easier for yourself and others to navigate.
Learn how to write code so that it clearly expresses your intent and ideas.
Ensuring that your research is reproducible by others.
Using testing frameworks to verify that your code behaves as intended, and to automatically detect when you introduce a bug or mistake into your code.
Running your code on various computing platforms that allow you to obtain results efficiently and without relying on your own laptop/computer.
This page defines the learning objectives for individual sections. These are skills that the reader should be able to demonstrate after reading through the relevant section, and completing any exercises in that section.
"},{"location":"guides/learning-objectives/#version-control-concepts","title":"Version control concepts","text":"After completing this section, you should be able to identify how to apply version control concepts to your existing work. This includes being able to:
Identify projects and tasks for which version control would be suitable;
Categorise recent work activities into one or more commits;
Write commit messages that describe what changes you made and why you made them; and
Identify pieces of work that could be carried out in separate branches of a repository.
After completing this section, you should be able to:
Create a local repository;
Create commits in your local repository;
Search your commit history to identify commits that made a specific change;
Create a remote repository;
Push commits from your local repository to a remote repository;
Pull commits from a remote repository to your local repository;
Use tags to identify important milestones;
Work in a separate branch and then merge your changes into your main branch; and
Resolve merge conflicts.
After completing this section, you should be able to:
Share a repository with one or more collaborators;
Create a pull request;
Use a pull request to review a collaborator's work;
Use a pull request to merge a collaborator's work into your main branch; and
Conduct peer code review in a respectful manner.
After completing this section, you should be able to:
Understand how to structure a new project;
Understand how to separate \"what to do\" from \"how to do it\"; and
Structure your code to enable new experiments and analyses.
After completing this section, you should be able to:
Divide your code into functions and modules;
Ensure that your code is a clear expression of your ideas;
Structure your code into reusable packages; and
Take advantage of code formatters and code linters.
These materials assume that the reader has a basic knowledge of the Bash command-line shell and using SSH to connect to remote computers. You should be comfortable with using the command-line to perform the following tasks:
Please refer to the following materials for further details:
Info
If you use Windows, you may want to use PowerShell instead of Bash, in which case please refer to this Introduction to the Windows Command Line with Powershell.
Some chapters also assume that the reader has an account on GitHub and has added an SSH key to their account.
"},{"location":"guides/resources/","title":"Useful resources","text":""},{"location":"guides/resources/#education-and-commentary-articles","title":"Education and commentary articles","text":"A Beginner's Guide to Conducting Reproducible Research describes key requirements for producing reproducible research outputs.
Why code rusts collects together some of reasons the behaviour of code changes over time.
Point of View: How open science helps researchers succeed presents evidence that open research practices bring significant benefits to researchers.
The Journal of Statistics and Data Science Education published a special issue: Teaching Reproducibility in November 2022. Also see the presentations from an invited paper session:
Collaborative writing workflows: building blocks towards reproducibility
Opinionated practices for teaching reproducibility: motivation, guided instruction, and practice
From teaching to practice: Insights from the Toronto Reproducibility Conferences
Teaching reproducibility and responsible workflow: an editor's perspective
A Quick Guide to Organizing Computational Biology Projects suggests an approach for structuring a computational research repository.
The TIER Protocol 4.0 provides a template for organising the contents and reproduction documentation for projects that involve working with statistical data:
Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.
A simple kit to use computational notebooks for more openness, reproducibility, and productivity in research provides some good recommendations for organising a project repository and setting up a reproducible workflow using computational notebooks.
NDP Software have created an interactive Git cheat-sheet that shows how git commands interact with the local and upstream repositories, and provides brief documentation for many common examples.
The Pro Git book is available online. It starts with an overview of Git basics and then covers every aspect of Git in great detail.
The Software Carpentry Foundation publishes many lessons, including Version Control with Git.
A Quick Introduction to Version Control with Git and GitHub provides a short guide to using Git and GitHub. It presents an example of analysing publicly available ChIP-seq data with Python. The repository for the article is also publicly available.
CoMo Consortium App: the COVID-19 International Modelling Consortium (CoMo) has developed a Shiny web application for an age-structured, compartmental SEIRS model.
Mastering Shiny: an online book that teaches how to create web applications with R and Shiny.
The Art of Giving and Receiving Code Reviews (Gracefully)
Code Review in the Lab
Scientific Code Review
The 5 Golden Rules of Code Review
GitHub Actions for Python: the GitHub Actions documentation includes examples of building and testing Python projects.
Building reproducible analytical pipelines with R: this article shows how to use GitHub Actions to run R code when you push new commits to a GitHub repository.
GitHub Actions for the R language: this repository provides a variety of GitHub actions for R projects, such as installing specific versions of R and R packages.
See the GitHub actions for Git is my lab book. The build action does the following:
Check out the repository with actions/checkout
;
Install Python with actions/setup-python
;
Install Material for MkDocs and other required tools, as listed in requirements.txt
; and
Build the HTML version of this book with mkdocs
.
How to access the ARDC Nectar Research Cloud
Melbourne Research Cloud
High Performance Computing at University of Melbourne
The ARDC Guide to making software citable explains how to cite your code and assign it a DOI.
Recognizing the value of software: a software citation guide provides further examples and guidance for ensuring your work receives proper attribution and credit.
Choose an open source license provides advice for selecting an appropriate license that meets your needs.
A Quick Guide to Software Licensing for the Scientist-Programmer explains the various types of available licenses and provides advice for selecting a suitable license.
This section demonstrates how to use Git for collaborative research, enabling multiple people to work on the same code or paper in parallel. This includes deciding how to structure your repository, how to use branches for each collaborator, and how to use tags to track your progress.
Info
We also show how these skills support peer code review, so that you can share knowledge with, and learn from, your colleagues as part of your regular activity.
"},{"location":"guides/collaborating/an-example-pull-request/","title":"An example pull request","text":"The initial draft of each chapter in this section were proposed in a pull request.
When this pull request was created, the branch added four new commits:
85594bf Add some guidelines for collaboration workflows\n678499b Discuss coding style guides\n2b9ff70 Discuss merge/pull requests and peer code review\n6cc6f54 Discuss repository structure and licenses\n
and the author (Rob Moss) asked the reviewer (Eamon Conway) to address several details in particular.
Eamon made several suggestions in their initial response, including:
Moving the How to structure a repository and Choosing a license chapters to the Effective use of git section;
Starting this section with the Collaborating on code chapter; and
Agreeing that we should use this pull request as an example in this book.
In response, Rob pushed two commits that addressed the first two points above:
e1d1dd9 Move collaboration guidelines to the start\n3f78ef8 Move the repository structure and license chapters\n
and then wrote this chapter to show how we used a pull request to draft this book section.
"},{"location":"guides/collaborating/coding-style-guides/","title":"Coding style guides","text":"A style guide defines rules and guidelines for how to structure and format your code. This can make code easier to write, because you don't need to worry about how to format your code. It can also make code easier to read, because consistent styling allows you to focus on the content.
There are two types of tools that can help you use a style guide:
A formatter formats your code to make it consistent with a specific style; and
A linter checks whether your code is consistent with a specific style.
Because programming languages can be very different from each other, style guides are usually defined for a single programming language.
Here we list some of the most widely-used style guides for several common programming languages:
You can check that your code conforms to this style with lintr.
For Python there is Black, which defines a coding style and applies this style to your code.
For C++ there is a Google C++ style guide.
Once you are comfortable with creating commits, working in branches, and merging branches, you can use these skills to write papers collaboratively as a team. This approach is particularly useful if you are writing a paper in LaTeX.
Here are some general guidelines that you may find useful:
Divide the paper into separate LaTeX files for each section.
Use tags to identify milestones such as draft versions and revisions.
Consider creating a separate branch for each collaborator.
Use latexdiff to show tracked changes between the current version and a previous commit/tag:
latexdiff-git --flatten -r tag-name paper.tex\n
Collaborators who will provide feedback, rather than contributing directly to the writing process, can do this by:
Annotating PDF versions of the paper; or
Once you are comfortable with creating commits, working in branches, and merging branches, you can use these skills to write code collaboratively as a team.
The precise workflow will depend on the nature of your research and on the collaborators in your team, but there are some general guidelines that you may find helpful:
Agree on a style guide.
Work on separate features in separate branches.
Use peer code review before merging changes from these branches.
Consider using continuous integration to:
Run test cases and detect bugs as early as possible; and
Continuous Integration (CI) is an automated process where code changes are merged in a central repository in order to run automated tests and other processes. This can provide rapid feedback while you develop your code and collaborate with others, as long as commits are regularly pushed to the central repository.
Info
This book is an example of Continuous Integration: every time a commit is pushed to the central repository, the online book is automatically updated.
Because the central repository is hosted on GitHub, we use GitHub Actions. Note that this is a GitHub-specific CI system. You can view the update action for this book here.
We also use CI to publish each pull request, so that contributions can be previewed during the review process. We added this feature in this pull request.
"},{"location":"guides/collaborating/merge-pull-requests/","title":"Merge/Pull requests","text":"Recall that incorporating the changes from one branch into another branch is referred to as a \"merge\". You can merge one branch into another branch by taking the following steps:
Checking out the branch you want to merge the changes into:
git checkout -b my-branch\n
Merging the changes from the other branch:
git merge other-branch\n
Tip
It's a good idea to review these changes before you merge them.
If possible, it's even better to have someone else review the changes.
You can use git diff
to view differences between branches. However, platforms such as GitHub and GitLab offer an easier approach: \"pull requests\" (also called \"merge requests\").
The steps required to create a pull request differ depending on which platform you are using. Here, we will describe how to create a pull request on GitHub. For further details, see the GitHub documentation.
Open the main page of your GitHub repository.
In the \"Branch\" menu, select the branch that contains the changes you want to merge.
Open the \"Contribute\" menu. This should be located on the right-hand side, above the list of files.
Click the \"Open pull request\" button.
In the \"base\" menu, select the branch you want to merge the changes into.
Enter a descriptive title for the pull request.
In the message editor, write a summary of the changes in this branch, and identify specific questions or objectives that you want the reviewer to address.
Select potential reviewers by clicking on the \"Reviewers\" link in the right-hand sidebar.
Click the \"Create pull request\" button.
Once the pull request has been created, the reviewer(s) can review your changes and discuss their feedback and suggestions with you.
"},{"location":"guides/collaborating/merge-pull-requests/#merging-a-pull-request-on-github","title":"Merging a pull request on GitHub","text":"When the pull request has been reviewed to your satisfaction, you can merge these changes by clicking the \"Merge pull request\" button.
Info
If the pull request has merge conflicts (e.g., if the branch you're merging into contains new commits), you will need to resolve these conflicts.
For further details about merging pull requests on GitHub, see the GitHub documentation.
"},{"location":"guides/collaborating/peer-code-review/","title":"Peer code review","text":"Once you're comfortable in using merge/pull requests to review changes in a branch, you can use this approach for peer code review.
Info
Remember that code review is a discussion and critique of a person's work. The code author will naturally feel that they own the code, and the reviewer needs to respect this.
For further advice and suggestions on how to conduct peer code review, please see the Performing peer code review resources.
Tip
Mention people who have reviewed your code in the acknowledgements section of the paper.
"},{"location":"guides/collaborating/peer-code-review/#define-the-goals-of-a-peer-review","title":"Define the goals of a peer review","text":"In creating a pull request and inviting someone to review your work, the pull request description should include the following details:
An overview of the work included in the pull request: what have you done, why have you done it?
You may also want to explain how this work fits into the broader context of your research project.
Identify specific questions or tasks that you would like the reviewer to address. For example, you might ask the reviewer to address one or more of the following questions:
Can the reviewer run your code and reproduce the outputs?
Is the code easy to understand?
If you have a style guide, is the code formatted appropriately?
Do the model equation or data analysis steps seem sensible?
If you have written documentation, is it easy to understand?
Can the reviewer suggest how to improve or rewrite a specific piece of code?
Tip
Make the reviewer's job easier by giving them small amounts of code to review.
"},{"location":"guides/collaborating/peer-code-review/#finding-a-reviewer","title":"Finding a reviewer","text":"On GitHub we have started a peer-review team. We encourage you to post on the discussion board, to find like-minded members to review your code.
"},{"location":"guides/collaborating/peer-code-review/#guidelines-for-reviewing-other-peoples-code","title":"Guidelines for reviewing other people's code","text":"Peer code review is an opportunity for the author and the reviewer to learn from each other and improve a piece of code.
Tip
The most important guideline for the reviewer is to be kind.
Treat other people's code the way you would want them to treat your code.
Avoid saying \"you\". Instead, say \"we\" or make the code the subject of the sentence.
Don't say \"You don't have a test for this function\", but instead say \"We should test this function\".
Don't say \"Why did you write it this way?\", but instead say \"What are the advantages of this approach?\".
Ask questions rather than stating criticisms.
Don't say \"This code is unclear\", but instead say \"Can you help me understand how this code works?\".
Treat peer review as an opportunity to praise good work!
Don't be afraid to tell the author that a piece of code was very clear, easy to understand, or well written.
Tell the author if reading their code made you aware of a useful function or package.
Tell the author if reading their code gave you some ideas for your own code.
Once the peer code review is complete, and any corresponding updates to the code have been made, you can merge the branch.
"},{"location":"guides/collaborating/peer-code-review/#retain-a-record-of-the-review","title":"Retain a record of the review","text":"By using merge/pull requests to review code, the discussion between the author and the reviewer is recorded. This can be a useful reference for future code reviews.
Tip
Try to record all of the discussion in the pull request comments, even if the author and reviewer meet in person, so that you have a complete record of the review.
"},{"location":"guides/collaborating/sharing-a-branch/","title":"Sharing a branch","text":"You might want a collaborator to work on a specific branch of your repository, so that you can keep their changes separate from your own work. Remember that you can merge commits from their branch into your own branches at any time.
Info
You need to ensure that your collaborator has access to the remote repository.
Create a new branch for the collaborator, and give it a descriptive name.
git checkout -b collab/jamie\n
In this example we created a branch called \"collab/jamie\", where \"collab\" is a prefix used to identify branches intended for collaborators, and the collaborator is called Jamie.
Remember that you can choose your own naming conventions.
Push this branch to your remote repository:
git push -u origin collab/jamie\n
Your collaborator can then make a local copy of this branch:
git clone --single-branch --branch collab/jamie repository-url\n
They can then create commits and push them to your remote repository with git push
.
The easiest way to share a repository with collaborators is to have a single remote repository that all collaborators can access. This repository could be located on a platform such as GitHub, GitLab, or Bitbucket, or on a platform provided by your University or Institute.
Theses platforms allow you to create public repositories and private repositories.
Everybody can view the contents of a public repository.
You control who can view the contents of a private repository.
For both types of repository, you control who can make changes to the repository, such as creating commits and branches.
Info
You should decide whether a public repository or a private repository suits you best.
"},{"location":"guides/collaborating/sharing-a-repository/#giving-collaborators-access-to-your-remote-repository","title":"Giving collaborators access to your remote repository","text":"The steps required to do this differ depending on which platform you are using. Here, we will describe how to give collaborators access to a repository on GitHub. For further details, see the GitHub documentation.
Open the main page of your GitHub repository.
Click on the \"Settings\" tab in the top navigation bar.
Click on the \"Collaborators\" item in the left sidebar.
Click on the \"Add people\" button.
Search for collaborators by entering their GitHub user name, their full name, or their email address.
Click the \"Add to this repository\" button.
This will send an invitation to the collaborator. If they accept this invitation, they will have access to your repository.
"},{"location":"guides/high-performance-computing/","title":"Cloud and HPC platforms","text":"This section introduces computing platforms that allow you to generate outputs more quickly, and without relying on your own laptop or desktop computer. It also demonstrates how to use version control to ensure that the code running on these platforms is the same as the code on your laptop.
"},{"location":"guides/project-structure/","title":"Project structure","text":"How we choose to structure a project can affect how readily someone else \u2014 or even yourself, after a period of absence \u2014 can understand, use, and extend the work.
Question
Have you ever looked at your old code and wondered how it worked or how to make it run?
Tip
A good project structure can serve as a table of contents and help the reader to navigate.
In an earlier section we provided some guidelines for how to structure a repository. In this section we present further guidelines and examples to help you choose a sensible structure for your current project and future projects.
This includes high-level recommendations that should apply to any project, and more detailed recommendations that may be specific to a particular type of project or choice of programming language.
"},{"location":"guides/project-structure/automating-tasks/","title":"Automate common tasks","text":"If you reach the point where you need to run a specific sequence of commands or actions to achieve something \u2014 e.g., running a model simulation, or producing an output figure \u2014 it is a very good idea to write a script that performs all of these actions correctly.
This is because while you may remember exactly what needs to be done right now, you may not remember next week, or next month, or next year. We're all human, and we all make mistakes, but these kinds of mistakes are easy to avoid!
Info
Mistakes are a fact of life. It is the response to the error that counts.
\u2014 Nikki Giovanni
There are many tools that can help you to automate tasks, some of which are smart enough that they will only do as little as possible (e.g., avoid re-running steps if the inputs have not changed).
There are popular tools aimed at specific programming languages, such as:
R: targets;
Python: nox and tox; and
Julia: pipelines.
There are many generic automation tools (see, e.g., Wikipedia's list of build automation software), although these can be rather complex to learn. We recommend using a language-specific automation tool where possible, and only using a generic automation tool as a last resort.
"},{"location":"guides/project-structure/exercise-a-good-readme/","title":"Exercise: a good README","text":"Remember that the README file (usually one of README.md
, README.rst
, or README.txt
) is often the very first thing that a user will see when looking at your project.
Have you seen any README files that were particularly helpful, or were not very helpful?
What information do you find helpful in a README file?
Consider the README.md
file in the Australian 2020 COVID-19 forecasts repository.
What content, if any, would you add to this file?
What content, if any, would you remove from this file?
Would you change its structure in any way?
Look back at your past projects and identify aspects of their structure that you have found helpful.
What features or choices have worked well in past projects and might help you structure your future projects?
What problems or issues have you experienced with the structure of your past projects, which you could avoid in your future projects?
Can any of your colleagues and collaborators share similar insights?
Once you've chosen a project structure, you need to write down how it all works \u2014 regardless of how simple and clear your project structure is!
Tip
The best place to do this is in a README.md
file (or equivalent) in the project root directory.
Begin with an overview of the project:
What question(s) are you trying to address?
What data, hypotheses, methods, etc, are you using?
What outputs does this generate?
You can then provide further detail, such as:
What software environment and/or packages must be available for your code to run?
How can the user generate each of the outputs?
What license have you chosen?
See the Australian 2020 COVID-19 forecasts repository for an example README.md
file.
This repository was used to generate the results, tables, and figures presented in the paper \"Forecasting COVID-19 activity in Australia to support pandemic response: May to October 2020\", Scientific Reports 13, 8763 (2023).
Strengths:
It includes installation and usage instructions;
It identifies the paper; and
It identifies the license under which the code is distributed.
Weaknesses:
It only explains some of the project structure.
It doesn't provide an overview of the project, it only links to the paper.
The root directory contains a number of scripts and input files that aren't described.
A good first step in deciding how to structure a project is to ask yourself:
What are the different project phases?
What are the major activities in each phase?
For example, a project might involve the following phases:
Clean an existing data set;
Build models with different hypotheses or features;
Fit each model to the data; and
Decide which model best explains the data.
The data-cleaning phase might involve the following activities:
Obtain the raw data;
Identify the quality checks that should be applied;
Decide how to resolve data that fail each quality check; and
Generate and record the cleaned data.
The model-building phase might involve the following activities:
Perform a literature search to identify relevant modelling studies;
Identify competing hypotheses or features that might explain the data;
Design a model that implements each hypothesis; and
Define the relationship between each model and the cleaned data.
You can use the phases and activities to guide your choice of directory structure. For this example project, one possible structure is:
project/
: the root directory of your project
input/
: a sub-directory that contains input data;
raw/
: the raw data before cleaning;
cleaned/
: the cleaned data;
code/
: a sub-directory that contains the project code;
cleaning/
: the data cleaning code;
model-first-hypothesis/
: the first model;
model-second-hypothesis/
: the second model;
fitting/
: the code that fits each model to the data;
evaluation/
: the code the compares the model fits;
plotting/
: the code that plots output figures;
paper/
: a sub-directory for the project manuscript;
figures/
: the output figures;This section demonstrates how to use version control and software testing to ensure that your research results can be independently reproduced by others.
Tip
Reproducibility is just as much about simple work habits as the tools used to share data and code.
\u2014 Jesse M.\u00a0Alston and Jessica A.\u00a0Rick
"},{"location":"guides/reproducibility/what-is-reproducible-research/","title":"What is reproducible research?","text":"Various scientific disciplines have defined and used the terms \"replication\" and \"reproducible\" in different (and sometimes contradictory) ways. But in recent years there have been efforts to standardise these terms, particularly in the context of computational research. Here we will use the definitions from Reproducibility and Replicability in Science:
Replication: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.
Reproducible: obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.
Question
If you can't explain your research well enough for someone else to reproduce the results, can you really call what you did \"research\"?
Creating reproducible research requires good work habits and practices, and also a way to verify that the results are reproducible.
It is sometimes said that reproducibility can be achieved by sharing all of the code and data with a published article, but this is not necessarily sufficient for computationally complex studies. There are many reasons why running the same code with the same data may produce different results.
Tip
It is easier to make our research reproducible by starting at the planning stage, rather than waiting until we've produced results.
"},{"location":"guides/testing/","title":"Testing","text":"This section introduces the topic of software testing. Testing your code is an important part of any code-based research activity. Tests check whether your code behaves as intended, and can warn you if you introduce a bug or mistake into your code.
Tip
Tests can show the presence of bugs, but not their absence.
\u2014 Edsger W.\u00a0Dijkstra
"},{"location":"guides/using-git/","title":"Effective use of git","text":"This section shows how to use the git
command-line program to record your work, to inspect your commit history, and to search this commit history to identify commits that make specific changes or have specific effects.
Reminder
Remember to commit early and commit often. Do not wait until your code is \"perfect\".
"},{"location":"guides/using-git/choosing-a-license/","title":"Choosing a license","text":"A license specifies the conditions under which others may use, modify, and/or distribute your work.
Info
Simply making a repository publicly accessible is not sufficient to allow others to make use of your work. Unless you include a license that specifies otherwise, nobody else can copy, distribute, or modify your work.
There are many different types of licenses that you can use, and the number of options can seem overwhelming. But it is usually straightforward to narrow down your options.
If you're working on an existing project, the easiest option is to use that project's license.
If you're working with an existing community, they may have a preferred license.
If you want to choose an open source license, the Choose an open source license website provides advice for selecting a license that meets your needs.
For further information about the various types of available licenses, and some advice for selecting a suitable license for academic software, see A Quick Guide to Software Licensing for the Scientist-Programmer.
"},{"location":"guides/using-git/choosing-your-git-editor/","title":"Choosing your Git editor","text":"In this video, we show how to use nano and vim for writing commit messages. See below for brief instructions on how to use these editors.
Tip
This editor is only used for writing commit messages. It is entirely separate from your choice of editor for any other task, such as writing code.
Git editor exampleVideo timeline:
Note
You can pause the video to select and copy any of the text, such as the git config --global core.editor
commands.
Once you have written your commit message, press Ctrl + O and then Enter to save the commit message, then press Ctrl + X to quit the editor.
To quit without saving press Ctrl + X. If you have made any changes, nano
will ask if you want to save them. Press n to quit without saving these changes.
You need to press i (switch to insert mode) before you can write your commit message. Once you have written your commit message, press Esc and then type :wq to save your changes and quit the editor.
To quit without saving press Esc and then type :q!.
"},{"location":"guides/using-git/cloning-an-existing-repository/","title":"Cloning an existing repository","text":"If there is an existing repository that you want to work on, you can \"clone\" the repository and have a local copy. To do this, you need to know the remote repository's URL.
Tip
For GitHub repositories, there should be a green button labelled \"Code\". Click on this button, and it will provide you with the URL.
You can then make a local copy of the repository by running:
git clone URL\n
For example, to make a local copy of this book, run the following command:
git clone https://github.com/robmoss/git-is-my-lab-book.git\n
This will create a local copy in the directory git-is-my-lab-book
.
Note
If you have a GitHub account and have set up an SSH key, you can clone GitHub repositories using your SSH key. This will allow you to push commits to the remote repository (if you are permitted to do so) without having to enter your user name and password.
You can obtain the SSH URL from GitHub by clicking on the green \"Code\" button, and selecting the \"SSH\" tab.
For example, to make a local copy of this book using SSH, run the following command:
git clone git@github.com:robmoss/git-is-my-lab-book.git\n
"},{"location":"guides/using-git/creating-a-commit/","title":"Creating a commit","text":"Creating a commit involves two steps:
Identify the changes that should be included in the commit. These changes are then \"staged\" and ready to be included in the next commit.
Create a new commit that records these staged changes. This should be accompanied by a useful commit message.
We will now show how to perform these steps.
Note
At any time, you can see a summary of the changes in your repository, and which ones are staged to be committed, by running:
git status\n
This will show you:
If you've created a new file, you can include this file in the next commit by running:
git add filename\n
"},{"location":"guides/using-git/creating-a-commit/#adding-all-changes-in-an-existing-file","title":"Adding all changes in an existing file","text":"If you've made changes to an existing file, you can include all of these changes in the next commit by running:
git add filename\n
"},{"location":"guides/using-git/creating-a-commit/#adding-some-changes-in-an-existing-file","title":"Adding some changes in an existing file","text":"If you've made changes to an existing file and only want to include some of these changes in the next commit, you can select the changes to include by running:
git add -p filename\n
This will show you each of the changes in turn, and allow you select which ones to stage.
Tip
This interactive selection mode is very flexible; you can enter ?
at any of the prompts to see the range of available actions.
If you want to rename a file, you can use git mv
to rename the file and stage this change for inclusion in the next commit:
git mv filename newname\n
"},{"location":"guides/using-git/creating-a-commit/#removing-a-file","title":"Removing a file","text":"If you want to remove a file, you can use git rm
to remove the file and stage this change for inclusion in the next commit:
git rm filename\n
Tip
If the file has any uncommitted changes, git will refuse to remove the file. You can override this behaviour by running:
git rm --force filename\n
"},{"location":"guides/using-git/creating-a-commit/#inspecting-the-staged-changes","title":"Inspecting the staged changes","text":"To verify that you have staged all of the desired changes, you can view the staged changes by running:
git diff --cached\n
You can view the staged changes for a specific file by running:
git diff --cached filename\n
"},{"location":"guides/using-git/creating-a-commit/#undoing-a-staged-change","title":"Undoing a staged change","text":"You may sometimes stage a change for inclusion in the next commit, but decide later that you don't want to include it in the next commit. You can undo staged changes to a file by running:
git restore --staged filename\n
Note
This will not modify the contents of the file.
"},{"location":"guides/using-git/creating-a-commit/#creating-a-new-commit","title":"Creating a new commit","text":"Once you have staged all of the changes that you want to include in the commit, create the commit by running:
git commit\n
This will open your chosen editor and prompt you to write the commit message.
Tip
Note that the commit will not be created until you exit the editor.
If you decide that you don't want to create the commit, you can abort this action by closing your editor without saving a commit message.
Please see Choosing your Git editor for details.
"},{"location":"guides/using-git/creating-a-commit/#modifying-the-most-recent-commit","title":"Modifying the most recent commit","text":"After you create a commit, you might decide that there are other changes that should be included in the commit. Git provides a simple way of modifying the most recent commit.
Warning
Do not modify the commit if you have already pushed it to another repository. Instead, record a new commit that includes the desired changes.
Remember that your commit history should not be a highly-edited, polished view of your work, but should instead act as a lab book.
Do not worry about creating \"perfect\" commits!
To modify the most recent commit, stage the changes that you want to commit (see the sections above) and add them to the most recent commit by running:
git commit --amend\n
This will open your chosen editor and allow you to modify the commit message.
"},{"location":"guides/using-git/creating-a-remote-repository/","title":"Creating a remote repository","text":"Once you have created a \"local\" repository (i.e., a repository that exists on your own computer), it is generally a good idea to create a \"remote\" repository. You may choose to store this remote repository on a service such as GitHub, or on a University-provided platform.
If you are using GitHub, you can choose to create a public repository (viewable by anyone, but you control who can make changes) or a private repository (you control who can view and/or make changes).
"},{"location":"guides/using-git/creating-a-remote-repository/#linking-your-local-and-remote-repositories","title":"Linking your local and remote repositories","text":"Once you have created the remote repository, you need to link it to your local repository. This will allow you to \"push\" commits from your local repository to the remote repository, and to \"pull\" commits from the remote repository to your local repository.
Note
When you create a new repository on services such as GitHub, they will give you instructions on how to link this new repository to your local repository. We also provide an example, below.
A repository can be linked to more than one remote repository, so we need to choose a name to identify this remote repository.
Info
The name \"origin\" is commonly used to identify the main remote repository.
In this example, we link our local repository to the remote repository for this book (https://github.com/robmoss/git-is-my-lab-book
) with the following command:
git remote add origin git@github.com:robmoss/git-is-my-lab-book.git\n
Note
Notice that the URL is similar to, but not identical to, the URL you use to view the repository in your web browser.
"},{"location":"guides/using-git/creating-a-repository/","title":"Creating a repository","text":"You can create repositories by running git init
. This will create a .git
directory that will contain all of the repository information.
There are two common ways to use git init
:
Create an empty repository in the current directory, by running:
git init\n
Create an empty repository in a specific directory, by running:
git init path/to/repository\n
Info
Git will create the repository directory if it does not exist.
In this exercise you will create a local repository, and use this repository to create multiple commits, switch between branches, and inspect the repository history.
Create a new, empty repository in a directory called git-exercise
.
Create a README.md
file and write a brief description for this repository. Record the contents of README.md
in a new commit, and write a commit message.
Write a script that generates a small data set, and saves the data to a CSV file. For example, this script could sample values from a probability distribution with fixed shape parameters. Explain how to use this script in README.md
. Record your changes in a new commit.
Write a script that plots these data, and saves the figure in a suitable file format. Explain how to use this script in README.md
. Record your changes in a new commit.
Add a tag milestone-1
to the commit you created in the previous step.
Create a new branch called feature/new-data
. Check out this branch and modify the data-generation script so that it produces new data and/or more data. Record your changes in one or more new commits.
Create a new branch called feature/summarise
from the tag you created in step #5. Check out this branch and modify the plotting script so that it also prints some summary statistics of the data. Record your changes in one or more new commits.
In your main
or master
branch, and add a license. Record your changes in a new commit.
In your main
or master
branch, merge the two feature branches created in steps #6 and #7, and add a new tag milestone-2
.
Now that you have started a repository, created commits in multiple branches, and merged these branches, here are some questions for you to consider:
Have you committed the generated data file and/or the plot figure?
If you haven't committed either or both of these files, have you instructed git
to ignore them?
Did you add a meaningful description to each milestone tag?
How many commits modified your data-generation script?
How many commits modified your plotting script?
What changes, if any, were made to README.md
since it was first created?
Tip
To answer some of these questions, you may need to run git
commands.
We have created a public repository that you can use to try resolving a merge conflict yourself. This repository includes some example data and a script that performs some basic data analysis.
First, obtain a local copy (a \"clone\") of this repository by running:
git clone https://github.com/robmoss/gimlb-simple-merge-example.git\ncd gimlb-simple-merge-example\n
"},{"location":"guides/using-git/exercise-resolve-a-merge-conflict/#the-repository-history","title":"The repository history","text":"You can inspect the repository history by running git log
. Some key details to notice are:
README.md
LICENSE
analysis/initial_exploration.R
input_data/data.csv
The second commit created the following file:
outputs/summary.csv
This commit has been given the tag first_milestone
.
From this first_milestone
tag, two branches were created:
The feature/second-data-set
branch adds a second data set and updates the analysis script to inspect both data sets.
The feature/calculate-rate-of-change
branch changes which summary statistics are calculated for the original data set.
The example-solution
branch merges both feature branches and resolves any merge conflicts. This branch has been given the tag second_milestone
.
You will start with the master
branch, which contains the commits up to the first_milestone
tag, and then merge the two feature branches into this branch, resolving any merge conflicts that arise. You can then compare your results to the example-solution
branch.
Obtain a local copy of this repository, by running:
git clone https://github.com/robmoss/gimlb-simple-merge-example.git\ncd gimlb-simple-merge-example\n
Create local copies of the two feature branches and the example solution, by running:
git checkout feature/second-data-set\ngit checkout feature/calculate-rate-of-change\ngit checkout example-solution\n
Return to the master
branch, by running:
git checkout master\n
Merge the feature/second-data-set
branch into master
, by running:
git merge feature/second-data-set\n
Merge the feature/calculate-rate-of-change
branch into master
, by running:
git merge feature/calculate-rate-of-change\n
This will result in a merge conflict, and now you need to decide how to resolve each conflict! Once you have resolved the conflicts, create a commit that records all of your changes (see the previous chapter for an example).
Tip
You may find it helpful to inspect the commits in each of the feature branches to understand how they have changed the files in which the conflicts have occurred.
"},{"location":"guides/using-git/exercise-resolve-a-merge-conflict/#self-evaluation","title":"Self evaluation","text":"Once you have created a commit that resolves these conflicts, see how similar or different the contents of your commit are to the corresponding commit in the example-solution
branch (which has been tagged second_milestone
). You can inspect this commit by running:
git show example-solution\n
You can compare this commit to your solution by running:
git diff example-solution\n
How does your resolution compare to this commit?
Note
You may have resolved the conflicts differently to the example-solution
branch, and that's perfectly fine as long as they have the same effect.
Here we present a recorded terminal session in which we clone this repository and resolve the merge conflict.
Tip
You can use the video timeline (below) to jump to specific moments in this exercise. Remember that you can pause the recording at any point to select and copy any of the text.
Resolving a merge conflictVideo timeline:
feature/second-data-set
branchfeature/calculate-rate-of-change
branchfeature/second-data-set
branchfeature/calculate-rate-of-change
branchIn this exercise, you will use a remote repository to synchronise and merge changes between multiple local repositories, starting from the local git-exercise
repository that you created in the previous exercise.
Create a new remote repository on a platform such as GitHub. You can make this a private repository, because you won't need to share it with anyone.
Link your local git-exercise
repository to this remote repository, and push all branches and tags to this remote repository.
Make a local copy of this remote repository called git-exercise-2
.
Check out the main
or master
branch. The files should be identical to the milestone-2
tag in your original git-exercise
repository.
Create a new branch called feature/report
. Check out this branch and create a new file called report.md
. Edit this file so that it contains:
A brief description of the generated data set;
Record your changes in a new commit.
In your original git-exercise
repository, checkout the feature/report
branch from the remote repository and verify that it now contains the file report.md
.
Merge this branch into your main
or master
branch, and add a new tag milestone-3-report
.
Push the updated main
or master
branch to the remote repository.
git-exercise-2
repository, checkout the main
or master
branch and pull changes from the remote repository. It should now contain the file report.md
.Info
Congratulations! You have used a remote repository to synchronise and merge changes between two local repositories. You can use this workflow to collaborate with colleagues.
"},{"location":"guides/using-git/exercise-use-a-remote-repository/#self-evaluation","title":"Self evaluation","text":"Now that you have used commits and branches to share work between multiple repositories, here are some questions for you to consider:
Do you feel comfortable in deciding which changes to record in a single commit?
Do you feel that your commit messages help describe the changes that you have made in this repository?
Do you feel comfortable in using multiple branches to work on separate ideas in parallel?
Do you have any current projects that you might want to work on using local and remote git
repositories?
Once you've installed Git, you should define some important settings before you starting using Git.
Info
We assume that you will want to set the git configuration for all repositories owned by your user. Therefore, we use the --global
flag. Configuration files can be set for a single repository or the whole computer by replacing --global
with --local
or --system
respectively.
Define your user name and email address. These details are included in every commit that you create.
git config --global user.name \"My Name\"\ngit config --global user.email \"my.name@some-university.edu.au\"\n
2. Define the text editor that Git should use for tasks such as writing commit messages: git config --global core.editor editor-name\n
NOTE: on Windows you need to specify the full path to the editor:
git config --global core.editor \"C:/Program Files/My Editor/editor.exe\"\n
Tip
Please see Choosing your Git editor for details.
By default, Git will create a branch called master
when you create a new repository. You can set a different name for this initial branch:
git config --global init.defaultBranch main\n
Ensure that repository histories always record when branches were merged:
git config --global merge.ff no\n
This prevents Git from \"fast-forwarding\" when the destination branch contains no new commits. For example, it ensures that when you merge the green branch into the blue branch (as shown below) it records that commits D, E, and F came from the green branch.
Adjust how Git shows merge conflicts:
git config --global merge.conflictstyle diff3\n
This will be useful when we look at how to use branches and how to resolve merge conflicts.
Info
If you use Windows, there are tools that can improve your Git experience in PowerShell.
There are also tools for integrating Git into many common text editors. See Git in other environments, Appendix A of the Pro Git book.
"},{"location":"guides/using-git/graphical-git-clients/","title":"Graphical Git clients","text":"In this book we will primarily show how to use Git from the command-line. If you don't have Git already installed on your computer, see these instructions for installing Git.
In addition to using the command-line, there are other ways to work with Git repositories:
There are many graphical clients that you can download and use;
Many editors include built-in Git support (e.g., Atom, RStudio, Visual Studio Code); and
Online platforms such as GitHub, GitLab, and Bitbucket also provide a graphical interface for common Git actions.
All of the concepts and terminology you will learn in this book should also apply to all of the tools listed above.
"},{"location":"guides/using-git/how-to-create-and-use-tags/","title":"How to create and use tags","text":"Tags allow you to bookmark important points in your commit history.
You can use tags to identify milestones such as:
feature-age-dependent-mixing
);objective-1
, objective-2
);draft-1
, draft-2
); andsubmitted
, revision-1
).You can add a tag (in this example, \"my-tag\") to the current commit by running:
git tag -a my-tag\n
This will open your chosen editor and ask you to write a description for this tag.
"},{"location":"guides/using-git/how-to-create-and-use-tags/#pushing-tags-to-a-remote-repository","title":"Pushing tags to a remote repository","text":"By default, git push
doesn't push tags to remote repositories. Instead, you have to explicitly push tags. You can push a tag (in this example, called \"my-tag\") to a remote repository (in this example, called \"origin\") by running:
git push origin my-tag\n
You can push all of your tags to a remote repository (in this example, called \"origin\") by running:
git push origin --tags\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#tagging-a-past-commit","title":"Tagging a past commit","text":"To add a tag to a previous commit, you can identify the commit by its hash. For example, you can inspect your commit history by running:
git log --oneline --no-decorate\n
If your commit history looks like:
003cf6b Show how to ignore certain files\n339eb5a Show how to prepare and record commits\n6a7fb8b Show how to clone remote repositories\n...\n
where the current commit is 003cf6b
(\"Show how to ignore certain files\"), you can tag the previous commit (\"Show how to prepare and record commits\") by running: git tag -a my-tag 339eb5a\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#listing-tags","title":"Listing tags","text":"You can list all tags by running:
git tag\n
You can also list only tags that match a specific pattern (in this example, all tags beginning with \"my\") by running:
git tag --list 'my*'\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#deleting-tags","title":"Deleting tags","text":"You can delete a tag by running:
git tag --delete my-tag\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#creating-a-branch-from-a-tag","title":"Creating a branch from a tag","text":"You can check out a tag and begin working on a new branch by running:
git checkout -b my-branch my-tag\n
"},{"location":"guides/using-git/how-to-ignore-certain-files/","title":"How to ignore certain files","text":"Your repository may contain files that you don't want to include in your commit history. For example, you may not want to include files of the following types:
.aux
files, which are generated when compiling LaTeX documents; and.pyc
files, which are generated when running Python code..pdf
versions of LaTeX documents; andYou can instruct Git to ignore certain files by creating a .gitignore
file. This is a plain text file, where each line defines a pattern that identifies files and directories which should be ignored. You can also add comments, which must start with a #
, to explain the purpose of these patterns.
Tip
If your editor will not accept .gitignore
as a file name, you can create a .gitignore
file in your repository by running:
touch .gitignore\n
For example, the following .gitignore
file would make Git ignore all .aux
and .pyc
files, and the file my-paper.pdf
:
# Ignore all .aux files generated by LaTeX.\n*.aux\n# Ignore all byte-code files generated by Python.\n*.pyc\n# Ignore the PDF version of my paper.\nmy-paper.pdf\n
If you have sensitive data files, one option is to store them all in a dedicated directory and add this directory to your .gitignore
file:
# Ignore all data files in the \"sensitive-data\" directory.\nsensitive-data\n
Tip
You can force Git to add an ignored file to a commit by running:
git add --force my-paper.pdf\n
But it would generally be better to update your .gitignore
file so that it stops ignoring these files.
A merge conflict can occur when we try to merge one branch into another, if the two branches introduce any conflicting changes.
For example, consider trying to merge two branches that make the following changes to the same line of the file test.txt
:
On the branch my-new-branch
:
First line\n-Second line\n+My new second line\n Third line\n
On the main branch:
First line\n-Second line\n+A different second line\n Third line\n
When we attempt to merge my-new-branch
into the main branch, git merge my-new-branch
will tell us:
Auto-merging test.txt\nCONFLICT (content): Merge conflict in test.txt\nAutomatic merge failed; fix conflicts and then commit the result.\n
The test.txt
file will now include the conflicting changes, which we can inspect with git diff
:
diff --cc test.txt\nindex 18712c4,bc576a6..0000000\n--- a/test.txt\n+++ b/test.txt\n@@@ -1,3 -1,3 +1,7 @@@\n First line\n++<<<<<<< ours\n +A different second line\n++=======\n+ My new second line\n++>>>>>>> theirs\n Third line\n
Note that this two-day diff shows:
Each conflict is surrounded by <<<<<<<
and >>>>>>>
markers, and the conflicting changes are separated by a =======
marker.
If we instruct Git to use a three-way diff (see first-time Git setup), the conflict will be reported slightly differently:
diff --cc test.txt\nindex 18712c4,bc576a6..0000000\n--- a/test.txt\n+++ b/test.txt\n@@@ -1,3 -1,3 +1,7 @@@\n First line\n++<<<<<<< ours\n +A different second line\n++||||||| base\n++Second line\n++=======\n+ My new second line\n++>>>>>>> theirs\n Third line\n
In addition to showing \"our\" changes and \"their changes\", this three-way diff also shows the original lines, between the |||||||
and =======
markers. This extra information can help you decide how to best resolve the conflict.
We can edit test.txt
to reconcile these changes, and the commit our fix. For example, we might decide that test.txt
should have the following contents:
First line\nThe corrected second line\nThird line\n
We can then commit these changes to resolve the merge conflict:
git add test.txt\ngit commit -m \"Resolved the merge conflict\"\n
"},{"location":"guides/using-git/how-to-resolve-merge-conflicts/#cancelling-the-merge","title":"Cancelling the merge","text":"Alternatively, you may decide you don't want to merge these two branches, in which case you cancel the merge by running:
git merge --abort\n
"},{"location":"guides/using-git/how-to-structure-a-repository/","title":"How to structure a repository","text":"While there is no single \"best\" way to structure a repository, there are some guidelines that you can follow. The key aims are to ensure that your files are logically organised, and that others can easily navigate the repository.
"},{"location":"guides/using-git/how-to-structure-a-repository/#divide-your-repository-into-multiple-directories","title":"Divide your repository into multiple directories","text":"It is generally a good idea to have separate directories for different types of files. For example, your repository might contain any of these different file types, and you should at least consider storing each of them in a separate directory:
Choosing file names that indicate what each file/directory contains can help other people, such as your collaborators, navigate your repository. They can also help you when you return to a project after several weeks or months.
Tip
Have you ever asked yourself \"where is the file that contains X\"?
Use descriptive file names, and the answer might be right in front of you!
"},{"location":"guides/using-git/how-to-structure-a-repository/#include-a-readme-file","title":"Include aREADME
file","text":"You can write this in Markdown (README.md
), in plain text (README
or README.txt
), or in any other suitable format. For example, Python projects often use reStructuredText and have a README.rst
file.
This file should begin with a brief description of why the repository was created and what it contains.
Importantly, this file should also mention:
How the files and directories are arranged. Help your collaborators understand where they need to look in order to find something.
How to run important pieces of code (e.g., to generate output data files or figures).
The software packages and/or libraries that are required run any of the code in this repository.
The license (if any) under which the repository contents are being made available.
Recall that branches allow you to work on different ideas or tasks in parallel, within a single repository. In this chapter, we will show you how create and use branches. In the Collaborating section, we will show you how branches can allow multiple people to work together on code and papers, and how you can use branches for peer code review.
Info
Branches, like tags, are identified by name. Common naming conventions include:
feature/some-new-thing
for adding something new (a new data analysis, a new model feature, etc); andbugfix/some-problem
for fixing something that isn't working as intended (e.g., perhaps there's a mistake in a data analysis script).You can choose your own conventions, but make sure that you choose meaningful names.
Do not use names like branch1
, branch2
, etc.
You can create a new branch (in this example, called \"my-new-branch\") that starts from the current commit by running:
git checkout -b my-new-branch\n
You can also create a new branch that starts from a specific commit, tag, or branch in your repository:
git checkout -b my-new-branch 95eaae5 # From an existing commit\ngit checkout -b my-new-branch my-tag-name # From an existing tag\ngit checkout -b my-new-branch my-other-branch # From an existing branch\n
You can then create a corresponding upstream branch in your remote repository (in this example, called \"origin\") by running:
git push -u origin my-new-branch\n
"},{"location":"guides/using-git/how-to-use-branches/#working-on-a-remote-branch","title":"Working on a remote branch","text":"If there is a branch in your remote repository that you want to work on, you can make a local copy by running:
git checkout remote-branch-name\n
This will create a local branch with the same name (in this example, \"remote-branch-name\").
"},{"location":"guides/using-git/how-to-use-branches/#listing-branches","title":"Listing branches","text":"You can list all of the branches in your repository by running:
git branch\n
This will also highlight the current branch.
"},{"location":"guides/using-git/how-to-use-branches/#switching-between-branches","title":"Switching between branches","text":"You can switch from your current branch to another branch (in this example, called \"other-branch\") by running:
git checkout other-branch\n
Info
Git will not let you switch branches if you have any uncommitted changes.
One way to avoid this issue is to record the current changes as a new commit, and explain in the commit message that this is a snapshot of work in progress.
A second option is to discard the uncommitted changes to each file by running:
git restore file1 file2 file3 ... fileN\n
"},{"location":"guides/using-git/how-to-use-branches/#pushing-and-pulling-commits","title":"Pushing and pulling commits","text":"Once you have created a branch, you can use git push
to \"push\" your commits to the remote repository, and git pull
to \"pull\" commits from the remote repository. See Pushing and pulling commits for details.
You can use git log
to inspect the commit history of any branch:
git log branch-name\n
Remember that there are many ways to control what git log
will show you.
Similarly, you can use git diff
to compare the changes in any two branches:
git diff first-branch second-branch\n
Again, there are ways to control what git diff
will show you.
You may reach a point where you want to incorporate the changes from one branch into another branch. This is referred to as \"merging\" one branch into another, and is illustrated in the What is a branch? chapter.
For example, you might have completed a new feature for your model or data analysis, and now want to merge this back into your main branch.
First, ensure that the current branch is the branch you want to merge the changes into (this is often your main or master branch). You can them merge the changes from another branch (in this example, called \"other-branch\") by running:
git merge other-branch\n
This can have two different results:
The commits from other-branch
were merged successfully into the current branch; or
There were conflicting changes (referred to as a \"merge conflict\").
In the next chapter we will show you how to resolve merge conflicts.
"},{"location":"guides/using-git/inspecting-your-history/","title":"Inspecting your history","text":"You can inspect your commit history at any time with the git log
command. By default, this command will list every commit from the very first commit to the current commit, and for each commit it will show you:
There are many ways to adjust which commits and what details that git log
will show.
Tip
Each commit has a unique identifier (\"hash\"). These hashes are quite long, but in general you only need to provide the first 5-7 digits to uniquely identify a specific commit.
"},{"location":"guides/using-git/inspecting-your-history/#listing-commits-over-a-specific-time-interval","title":"Listing commits over a specific time interval","text":"You can limit which commits git log
will show by specifying a start time and/or an end time.
Tip
This can be extremely useful for generating progress reports and summarising your recent activity in team meetings.
For example, you can view commits from the past week by running:
git log --since='7 days'\ngit log --since='1 week'\n
You can view commits made between 1 and 2 weeks ago by running:
git log --since='2 weeks' --until='1 week'\n
You can view commits made between specific dates by running:
git log --since='2022/05/12' --until='2022/05/14'\n
"},{"location":"guides/using-git/inspecting-your-history/#listing-commits-that-modify-a-specific-file","title":"Listing commits that modify a specific file","text":"You can see which commits have made changes to a file by running:
git log -- filename\n
Info
Note the --
argument that comes before the file name. This ensures that if the file name begins with a -
, git log
will not treat the file name as an option.
You can make git log
display only the first 7 digits of each commit hash, and the first line of each commit message, by running:
git log --oneline\n
This can be a useful way to get a quick overview of the recent history.
"},{"location":"guides/using-git/inspecting-your-history/#viewing-the-contents-of-a-single-commit","title":"Viewing the contents of a single commit","text":"You can identify a commit by its unique identifier (\"hash\") or by its tag name (if it has been tagged), and view the commit with git show
:
git show commit-hash\ngit show tag-name\n
This will show the commit details and all of the changes that were recorded in this commit.
Tip
By default, git show
will show you the most recent commit.
You can view all of the changes that were made between two commits with the git diff
command.
Tip
The git diff
command shows the difference between two points in your commit history.
Note that git diff
does not support start and/or end times like git log
does; you must use commit identifiers.
For example, here is a subset of the commit history for this book's repository:
95eaae5 Note the need for a GitHub account and SSH key\n11085f0 Show how to create a branch from a tag\n9369482 Show how to create and use tags\n003cf6b Show how to ignore certain files\n339eb5a Show how to prepare and record commits\n6a7fb8b Show how to clone remote repositories\n6a49e10 Note that mdbook-admonish must be installed\na8e6114 Fixed the URL for the UoM GitLab instance\n5192704 Add a merge conflict exercise\n
We can view all of the changes that were made after the bottom commit (5192704
, \"Add a merge conflict exercise\") up to and including the top commit (95eaae5
, \"Note the need for a GitHub account and SSH key\") by running:
git diff 5192704..95eaae5\n
In the above example, 8 files were changed, with a total of 310 new lines and 7 deleted lines. This is a lot of information! You can print a summary of these changes by running:
git diff --stat 5192704..95eaae5\n
This should show you the following details:
README.md | 2 +-\n src/SUMMARY.md | 3 +\n src/prerequisites.md | 2 +\n src/using-git/cloning-an-existing-repository.md | 36 ++++++++++\n src/using-git/creating-a-commit.md | 146 +++++++++++++++++++++++++++++++++++++--\n src/using-git/how-to-create-and-use-tags.md | 89 ++++++++++++++++++++++++\n src/using-git/how-to-ignore-certain-files.md | 37 ++++++++++\n src/version-control/what-is-a-repository.md | 2 +-\n 8 files changed, 310 insertions(+), 7 deletions(-)\n
This reveals that about half of the changes (146 new/deleted lines) were made to src/using-git/creating-a-commit.md
.
Similar to the git log
command, you can limit the files that the git diff
command will examine. For example, you can display only the changes made to README.md
in the above example by running:
git diff 5192704..95eaae5 -- README.md\n
This should show you the following change:
diff --git a/README.md b/README.md\nindex 7956b65..a34f907 100644\n--- a/README.md\n+++ b/README.md\n@@ -15,7 +15,7 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 Inter\n\n ## Building the book\n\n-You can build this book by installing [mdBook](https://rust-lang.github.io/mdBook/) and running the following command in this directory:\n+You can build this book by installing [mdBook](https://rust-lang.github.io/mdBook/) and [mdbook-admonish](https://github.com/tommilligan/mdbook-admonish/), and running the following command in this directory:\n\n ```shell\n mdbook build\n
"},{"location":"guides/using-git/pushing-and-pulling-commits/","title":"Pushing and pulling commits","text":"In general, we \"push\" commits from our local repository to a remote repository by running:
git push <remote-repository>\n
and \"pull\" commits from a remote repository into our local repository by running:
git pull <remote-repository>\n
where <remote-repository>
is either a URL or the name of a remote repository.
However, we generally want to push to, and pull from, the same remote repository every time. See the next section for an example of linking the main branch in your local repository with a corresponding \"upstream\" branch in your remote repository.
"},{"location":"guides/using-git/pushing-and-pulling-commits/#pushing-your-first-commit-to-a-remote-repository","title":"Pushing your first commit to a remote repository","text":"In order to push commits from your local repository to a remote repository, we need to create a branch in the remote repository that corresponds to the main branch of our local repository. This requires that you have created at least one commit in your local repository.
Tip
This is a good time to create a README.md
file and write a brief description of what this repository will contain.
Once you have at least one commit in your local repository, you can create a corresponding upstream branch in the remote repository with the following command:
git push -u origin <branch-name>\n
The default branch will probably be called \"main\"
or \"master\"
, depending on your Git settings. You can identify the branch name by running:
git branch\n
Note
Recall that we identify remote repositories by name. In this example, the remote repository is call \"origin\". You can choose a different name when linking your local and remote repositories.
Once you have defined the upstream branch, you can push commits by running:
git push\n
and pull commits by running:
git pull\n
without having to specify the remote repository or branch name.
"},{"location":"guides/using-git/pushing-and-pulling-commits/#forcing-updates-to-a-remote-repository","title":"Forcing updates to a remote repository","text":"By default, Git will refuse to push commits from a local branch to a remote branch if the remote branch contains any commits that are not in your local branch. This situation should not arise in general, and typically indicates that either someone else has pushed new commits to the remote branch (see the Collaborating section) or that you have altered the history of your local branch.
If you are absolutely confident that your local history of commits should replace the contents of the remote branch, you can force this update by running:
git push --force\n
Tip
Unless you are confident that you understand why this situation has occurred, it is probably a good idea to ask for advice before running the above command.
"},{"location":"guides/using-git/where-did-this-line-come-from/","title":"Where did this line come from?","text":"Consider the What should I commit? chapter. Imagine that we want to know when and why the following text was added:
A helpful guideline is \"**commit early, commit often**\".\n
If we can identify the relevant commit, we can then inspect the commit (using git show <commit>
) to see all of the changes that it introduced. Ideally, the commit message will explain the reasons why this commit was made. This is one way in which your commit messages can act as a lab book.
At the time of writing (commit 2a96324
), the contents of the What should I commit? came from two commits:
git log --oneline src/version-control/what-should-I-commit.md\n
3dfff1f Add notes about committing early and often\n9be780b Briefly describe key version control concepts\n
We can use the git blame
command to identify the commit that last modified each line in this file:
git blame -s src/version-control/what-should-I-commit.md\n
9be780b8 1) # What should I commit?\n9be780b8 2)\n9be780b8 3) A commit should represent a **unit of work**.\n9be780b8 4)\n9be780b8 5) If you've made changes that represent multiple units of work (e.g., changing how input data are processed, and adding a new model parameter) these should be saved as separate commits.\n9be780b8 6)\n9be780b8 7) Try describing out loud the changes you have made, and if you find yourself saying something like \"I did X and Y and Z\", then the changes should probably divided into multiple commits.\n3dfff1fe 8)\n3dfff1fe 9) A helpful guideline is \"**commit early, commit often**\".\n3dfff1fe 10)\n3dfff1fe 11) ## Commit early\n3dfff1fe 12)\n3dfff1fe 13) - Don't delay creating a commit because \"it's not ready yet\".\n3dfff1fe 14)\n3dfff1fe 15) - A commit doesn't have to be \"perfect\".\n3dfff1fe 16)\n3dfff1fe 17) ## Commit often\n3dfff1fe 18)\n3dfff1fe 19) - Small, focused commits are **extremely helpful** when trying to identify the cause of an unintended change in your code's behaviour or output.\n3dfff1fe 20)\n3dfff1fe 21) - There is no such thing as too many commits.\n
You can see that the first seven lines were last modified by commit 9be780b
(Briefly describe key version control concepts), while the rest of the file was last modified by commit 3dfff1f
(Add notes about committing early and often). So the text that we're interested in (line 9) was introduced by commit 3dfff1f
.
You can inspect this commit by running the following command:
git show 3dfff1f\n
Video demonstration "},{"location":"guides/using-git/where-did-this-problem-come-from/","title":"Where did this problem come from?","text":"Let's find the commit that created the file src/version-control/what-is-a-repository.md
. We could find this out using git log
, but the point here is to illustrate how to use a script to find the commit that causes any arbitrary change to our repository.
Once the commit has been found, you can inspect it (using git show <commit>
) to see all of the changes this commit introduced and the commit message that (hopefully) explains the reasons why this commit was made. This is one way in which your commit messages can act as a lab book.
Create a Python script called my_test.py
with the following contents:
#!/usr/bin/env python3\nfrom pathlib import Path\nimport sys\n\nexpected_file = Path('src') / 'version-control' / 'what-is-a-repository.md'\n\nif expected_file.exists():\n # This file is the \"new\" thing that we want to detect.\n sys.exit(1)\nelse:\n # The file does not exist, this commit is \"old\".\n sys.exit(0)\n
For reference, here is an equivalent R script:
#!/usr/bin/Rscript --vanilla\n\nexpected_file <- file.path('src', 'version-control', 'what-is-a-repository.md')\n\nif (file.exists(expected_file)) {\n # This file is the \"new\" thing that we want to detect.\n quit(status = 1)\n} else {\n # The file does not exist, this commit is \"old\".\n quit(status = 0)\n}\n
Select the commit range over which to search. We know that the file exists in the commit 3dfff1f
(Add notes about committing early and often), and it did not exist in the very first commit (5a19b02
).
Instruct Git to start searching with the following command:
git bisect start 3dfff1f 5a19b02\n
Note that we specify the newer commit first, and then the older commit.
Git will inform you about the search progress, and which commit is currently being investigated.
Bisecting: 7 revisions left to test after this (roughly 3 steps)\n[92f1375db21dd8a35ca141365a477b963dbbf6dc] Add CC-BY-SA license text and badge\n
Instruct Git to use the script my_test.py
to check each commit with the following command:
git bisect run ./my_test.py\n
It will continue to report the search progress and automatically identify the commit that we're looking for:
running './my_test.py'\nBisecting: 3 revisions left to test after this (roughly 2 steps)\n[9be780b8785d67ee191b2c0b113270059c9e0c3a] Briefly describe key version control concepts\nrunning './my_test.py'\nBisecting: 1 revision left to test after this (roughly 1 step)\n[055906f28da146a2d012b7c1c0e4707503ed1b11] Display example commit message as plain text\nrunning './my_test.py'\nBisecting: 0 revisions left to test after this (roughly 0 steps)\n[1251357ab5b41d511deb48cd5386cae37eec6751] Rename the \"What is a repository?\" source file\nrunning './my_test.py'\n1251357ab5b41d511deb48cd5386cae37eec6751 is the first bad commit\ncommit 1251357ab5b41d511deb48cd5386cae37eec6751\nAuthor: Rob Moss <robm.dev@gmail.com>\nDate: Sun Apr 17 21:41:43 2022 +1000\n\n Rename the \"What is a repository?\" source file\n\n The file name was missing the word \"a\" and did not match the title.\n\n src/SUMMARY.md | 2 +-\n src/version-control/what-is-a-repository.md | 18 ++++++++++++++++++\n src/version-control/what-is-repository.md | 18 ------------------\n 3 files changed, 19 insertions(+), 19 deletions(-)\n create mode 100644 src/version-control/what-is-a-repository.md\n delete mode 100644 src/version-control/what-is-repository.md\n
To quit the search and return to your current commit, run the following command:
git bisect reset\n
You can then inspect this commit by running the following command:
git show 1251357\n
This section provides a high-level introduction to the concepts that you should understand in order to make effective use of version control.
Info
Version control can turn your files into a lab book that captures the broader context of your research activities and that you can easily search and reproduce.
"},{"location":"guides/version-control/exercise-using-version-control/","title":"Exercise: using version control","text":"In this section we have introduced version control, and outlined how it can be useful for academic research activities, including:
Info
We'd now like you think about how version control might be useful to you and your research.
Have you experienced any issues or challenges in your career where version control would have been helpful? For example:
Have you ever looked at some of your older code and had difficulty understanding what it is doing, how it works, or why it was written?
Have you ever had difficulties identifying what code and/or data were used to generate a particular analysis or output?
Have you ever discovered a bug in your code and tried to identify when it was introduced, or what outputs it might have affected?
When collaborating on a research project, have you ever had challenges in making sure that everyone was working with the most recent files?
How can you use version control in your current research project(s)?
Do you have an existing project or piece of code that could benefit from being stored in a repository?
Have you recently written any code that could be recorded as one or more commits?
If so, what would you write for the commit messages?
Have you written some exploratory code or analysis that could be stored in a separate branch?
Having looked at the use of version control in the past and present, how would using version control benefit you?
"},{"location":"guides/version-control/how-do-I-write-a-commit-message/","title":"How do I write a commit message?","text":"Commit messages are shown as part of the repository history (e.g., when running git log
). Each message consists of a short one-line description, followed by as much or as little text as required.
You should treat these messages as entries in a log book. Explain what changes were made and why they were made. This can help collaborators understand what we have done, but more importantly is acts as a record for our future selves.
Info
Have you ever looked at code you wrote a long time ago and wondered what you were thinking?
A history of detailed commit messages should allow you to answer this question!
Remember that code is harder to read than it is to write (Joel Spolsky).
For example, rather than writing:
Added model
You could write something like:
Implemented the initial model
This model includes all of the core features that we need to fit the data, but there several other features that we intend to add:
- Parameter X is currently constant, but we may need to allow it to vary over time;
- Parameter Y should probably be a hyperparameter; and
- The population includes age-structured mixing, but we need to also include age-specific outcomes, even though there is very little data to suggest what the age effects might be.
"},{"location":"guides/version-control/what-is-a-branch/","title":"What is a branch?","text":"A branch allows you create a series of commits that are separate from the main history of your repository. They can be used for units of work that are too large to be a single commit.
Info
It is easy to switch between branches! You can work on multiple ideas or tasks in parallel.
Consider a repository with three commits: commit A, followed by commit B, followed by commit C:
At this point, you might consider two ways to implement a new model feature. One way to do this is to create a separate branch for each implementation:
You can work on each branch, and switch between them, in the same local repository.
If you decide that the first implementation (the green branch) is the best way to proceed, you can then merge this branch back into your main branch. This means that your main branch now contains six commits (A to F), and you can continue adding new commits to your main branch:
"},{"location":"guides/version-control/what-is-a-commit/","title":"What is a commit?","text":"A \"commit\" is a set of changes to one or more files in a repository. These changes can include:
Each commit also includes the date and time that it was created, the user that created it, and a commit message.
"},{"location":"guides/version-control/what-is-a-merge-conflict/","title":"What is a merge conflict?","text":"In What is a branch? we presented an example of successfully merging a branch into another. However, when we try to merge one branch into another, we may find that the two branches have conflicting changes. This is known as a merge conflict.
Consider two branches that make conflicting changes to the same line of a file:
Replace \"Second line\" with \"My new second line\":
First line\n-Second line\n+My new second line\n Third line\n
Replace \"Second line\" with \"A different second line\":
First line\n-Second line\n+A different second line\n Third line\n
There is no way to automatically reconcile these two branches, and we have to fix this conflict manually. This means that we need to decide what the true result should be, edit the file to resolve these conflicting changes, and commit our modifications.
"},{"location":"guides/version-control/what-is-a-repository/","title":"What is a repository?","text":"A repository records a set of files managed by a version control system, including the historical record of changes made to these files.
You can create as many repositories as you want. Each repository should be a single \"thing\", such as a research project or a journal article, and should be located in a separate directory.
You will generally have at least two copies of each repository:
A local repository on your computer; and
A remote repository on a service such as GitHub, or a University-provided platform (such as the University of Melbourne's GitLab instance).
You make changes in your local repository and \"push\" them to the remote repository. You can share this remote repository with your collaborators and supervisors, and they will be able to see all of the changes that you have pushed.
You can also allow collaborators to push their own changes to the remote repository, and then \"pull\" them into your local repository. This is one way in which you can use version control to work collaboratively on a project.
"},{"location":"guides/version-control/what-is-a-tag/","title":"What is a tag?","text":"A tag is a short, unique name that identifies a specific commit. You can use tags as bookmarks for interesting or important commits. Common uses of tags include:
Identifying manuscript revisions: draft-1
, submitted-version
, revision-1
, etc.
Identifying software package versions: v1.0
, v1.1
, v2.0
, etc.
Version control is a way of systematically recording changes to files (such as computer code and data files). This allows you to restore any previous version of a file. More importantly, this history of changes can be queried, and each set of changes can include additional information, such as who made the changes and an explanation of why the changes were made.
A core component of making great decisions is understanding the rationale behind previous decisions. If we don't understand how we got \"here\", we run the risk of making things much worse.
\u2014 Chesterton's Fence
For academic research activities that involve data analysis or simulation modelling, some key uses of version control are:
You can use it as a log book, and capture a detailed and permanent record of every step of your research. This is extremely helpful for people \u2014 including you! \u2014 who want to understand and make use of your work.
You can collaborate with others in a systematic way, ensuring that everyone has access to the most recent files and data, and review everyone's contributions.
You can inspect the changes made over a period of interest (e.g., \"What have I done in the last week?\").
You can identify when a specific change occurred, and what other changes were made at the same time (e.g., \"What changes did I make that affected this output figure?\").
In this book we will focus on the Git version control system, which is used by popular online platforms such as GitHub, GitLab, and Bitbucket.
"},{"location":"guides/version-control/what-should-I-commit/","title":"What should I commit?","text":"A commit should represent a unit of work.
If you've made changes that represent multiple units of work (e.g., changing how input data are processed, and adding a new model parameter) these should be saved as separate commits.
Try describing out loud the changes you have made, and if you find yourself saying something like \"I did X and Y and Z\", then the changes should probably divided into multiple commits.
A helpful guideline is \"commit early, commit often\".
"},{"location":"guides/version-control/what-should-I-commit/#commit-early","title":"Commit early","text":"Don't delay creating a commit because \"it's not ready yet\".
A commit doesn't have to be \"perfect\".
Small, focused commits are extremely helpful when trying to identify the cause of an unintended change in your code's behaviour or output.
There is no such thing as too many commits.
For computational research, code is an important scientific artefact for the author, for colleagues and collaborators, and for the scientific community. It is the ultimate form of expressing what you did and how you did it. With good version control and documentation practices, it can also capture when and why you made important decisions.
Tip
[W]e want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.
\u2014 Structure and Interpretation of Computer Programs. Abelson, Sussman, and Sussman, 1984.
"},{"location":"guides/writing-code/behave-nicely/","title":"Behave nicely","text":"Would you feel comfortable running someone else's code if you thought it might affect your other files, applications, settings, or do something else that's unexpected?
Tip
Your code should be encapsulated: it should assume as little as possible about the computer on which it is running, and it shouldn't mess with the user's environment.
Tip
Your code should follow the principal of least surprise: behave in a way that most users will expect it to behave, and not astonish or surprise them.
"},{"location":"guides/writing-code/behave-nicely/#a-cake-analogy","title":"A cake analogy","text":"Suppose you have two colleagues who regularly bake cakes, and you decide you'd like one of them to bake you a chocolate cake.
A nice colleague:
A messy colleague:
Avoid modifying files outside of the project directory!
Avoid using hard-coded absolute paths, such as C:\\Users\\My Name\\Some Project\\...
or /Users/My Name/Some other directory
. These make it harder for other people to use the code, or to run the code on high-performance computing platforms.
Prefer using paths that are relative to the root directory of your project, such as input-data/case-data/cases-for-2023.csv
. If you're using R, the here package is extremely helpful.
Warn the user before running tasks that take a long time to complete.
Notify the user before downloading large files.
A \"linter\" is a tool that checks your code for syntax errors, possible mistakes, inconsistent formatting, and other potential issues.
We strongly recommend using an editor that displays linter warnings as you write your code. Having instant feedback allows you to rapidly resolve many common issues and substantially improve your code.
We list here some of the most commonly used linters:
R: lintr
Python: ruff
Julia: Lint.jl
Think about how to cleanly structure your code. Take a similar approach to how we write papers and grants.
Break the overall problem into pieces, and then decide how to structure each piece in turn.
Divide your code into functions that each do one \"thing\", and group related functions into separate files or modules.
It can sometimes help to think about how you want the final code to look, and then design the functions and components that are needed.
Avoid global variables, aim to pass everything as function arguments. This makes the code more robust and easier to run.
Avoid passing lots of individual parameters as separate arguments, this is prone to error \u2014 you might not pass them in the correct order. Instead, collect the parameters into a single structure (e.g, a Python dictionary, an R named list).
Avoid making multiple copies of a model if you want to change some aspect of its behaviour. Instead, add a new model parameter that enables/disables this new behaviour. This allows you to use the same code to run the older and newer versions of the model.
Try to collect common or related tasks into a single script, and allow the user to select which task(s) to run, rather than creating many scripts that perform very similar tasks.
Write test cases to check key model properties.
You want to identify problems and mistakes as soon as possible!
Thinking about how to make your code testable can help you improve its structure!
Well-written tests can also demonstrate how to use your code!
Divide your code into modules, each of which does one thing (\"high cohesion\") and depends as little as possible on other pieces (\"low coupling\").
"},{"location":"guides/writing-code/cohesion-coupling/#common-project-components","title":"Common project components","text":"For example, an infectious diseases modelling project might often be divided into some of the following components:
The model parameters \u2014 what are their values or prior distributions?
The initial model state \u2014 how is this created from the model parameters?
The model equations or update rules \u2014 how does the model evolve over time?
Summary statistics \u2014 what do you want to record for each simulation? This might be the entire state history, a subset of the history, some aggregate statistics, or any combination of these things.
The input data (if any) \u2014 these may be case data, serological data, within-host specimen counts, etc.
The relationship between data and the model state (\"observation model\").
Simulated data generated from a model simulation.
As much as possible, each of these components (where relevant to your project) should be represented as a separate piece of code.
"},{"location":"guides/writing-code/cohesion-coupling/#separating-the-what-from-the-how","title":"Separating the \"what\" from the \"how\"","text":"Dividing your code into separate components is especially important if you want to use a model for multiple purposes, such as:
Tip
In particular, keep the following aspects of your project separate:
What to do: fitting to different data sets, exploring different scenarios, performing a sensitivity analysis, etc; and
How to do it: the model implementation.
If you want to explore a range of model scenarios, for example, define the parameter values (or sampling distributions) for each scenario in a separate input file. Then write a script that takes an input file name as an argument, reads the parameter values, and uses these values to run the model simulations.
This makes it extremely simple to define and run new scenarios without modifying your code.
"},{"location":"guides/writing-code/cohesion-coupling/#interactions-between-components","title":"Interactions between components","text":"Choosing how your components interact (e.g., by calling functions or passing data) is just as important as deciding how to divide your code into components.
Here are some key recommendations from Object-Oriented Software Construction (2nd ed):
Small interfaces: if two modules communicate, they should exchange as little information as possible.
Explicit interfaces: if two modules communicate, it should be obvious from the code in one or both of these modules.
Self documentation: strive to make all information about a module part of the module itself.
For languages such as R, Python, and Julia, it is generally a good idea to write your code as a package/library. This can make it easier to install and run your code on a new computer, on a high-performance computing platform, and for others to use on their own computers.
Info
This is a simple process and entirely separate from publishing your package or making it publicly available.
It also means you can avoid using source()
in R, or adding directories to sys.path
in Python.
To create a package you need to provide some necessary information, such as a package name, and the list of the packages that your code depends on (\"dependencies\"). You can then use packaging tools to verify that you've correctly identified these dependencies and that your package can be successfully installed and used!
This is an important step towards ensuring your work is reproducible.
There are some great online resources that can help you get started. We list here some widely-recommended resources for specific languages.
"},{"location":"guides/writing-code/create-packages/#writing-r-packages","title":"Writing R packages","text":"For R, see R Packages (2nd ed) and the devtools package.
Other useful references include:
Info
rOpenSci offers peer review of statistical software.
"},{"location":"guides/writing-code/create-packages/#writing-python-packages","title":"Writing Python packages","text":"The Python Packaging User Guide provides a tutorial on Packaging Python Projects.
Other useful references include:
The pyOpenSci project also provide a Python Packaging Guide. This includes information about code style, formatting, and linters.
This example Python project demonstrates one way of structuring a Python project as a package.
Info
pyOpenSci offers peer review of scientific software
"},{"location":"guides/writing-code/create-packages/#writing-julia-packages","title":"Writing Julia Packages","text":"The Julia's package manager documentation provides a guide to Creating Packages
"},{"location":"guides/writing-code/document-your-code/","title":"Document your code","text":"Writing clear, well-structured code, can make it easier for someone to understand what your code does. You might think that this means your code is so clear and obvious that it needs no further explanation.
But this is not true! There is always a role for writing comments and documentation. By itself, your code cannot always explain:
What goal you are trying to achieve;
How you are achieving this goal; and
Why you've chosen this approach.
Question
What can you do to make your code more easily understandable?
"},{"location":"guides/writing-code/document-your-code/#naming","title":"Naming","text":"Use good names for functions, parameters, and variables. This can be deceptively hard.
Quote
There are only two hard things in Computer Science: cache invalidation and naming things.
\u2014 Phil Karlton
"},{"location":"guides/writing-code/document-your-code/#explaining","title":"Explaining","text":"Have you explained the intention of your code?
Tip
Good comments don't say what the code does; instead, they explain why the code does what it does.
For each function, write a comment that explains what the function does, describes the purpose of each parameter, and describes what values the function returns (if any).
"},{"location":"guides/writing-code/document-your-code/#documenting","title":"Documenting","text":"Many programming languages support \"docstrings\". These are usually comments with additional structure and formatting, and can be used to automatically generate documentation:
R: roxygen2
Python: there are several formats
Julia: Writing Documentation
See the CodeRefinery In-code documentation lesson for some good examples of docstrings.
"},{"location":"guides/writing-code/document-your-code/#commenting-out-code","title":"Commenting out code","text":"Avoid commenting out code. If it's no longer useful, delete it and save this as a commit! Make sure you write a helpful commit message. You can always recover the deleted code if you need it later.
"},{"location":"guides/writing-code/exercise-seek-feedback/","title":"Exercise: seek feedback","text":"Question
One goal to keep in mind is to ensure your work is conceptually accessible: how readily could someone else (or even yourself, after a period of absence) understand your code?
Question
Have you ever looked at someone else's code and found it hard to read because they formatted it differently to your code?
Using a consistent code style can help make your code more legible and accessible to others, in much the same way that standard use of punctuation and spacing makes written text easier to read.
Tip
Good coding style is like using correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
\u2014 Hadley Wickham, the tidyverse style guide
We strongly recommend using an editor that can automatically format your code whenever you save. This allows you to completely forget about formatting and focus on the content.
We list here some of the most commonly used style guides and code formatters:
Language Style guide(s) Formatter R tidyverse styler Python PEP 8 and The Hitchhiker's Style Guide black Julia style guide Lint.jl"},{"location":"guides/writing-code/how-we-learn-to-write-code/","title":"How we learn to write code","text":"Question
How have you learned to write code? Were you given any formal training?
Unless you studied Software Engineering, you may never have had any formal training. And that's okay! Nobody writes perfect code.
There are various resources available (including this book) that can help you to improve your coding skills. But the most effective way to improve is to write code and get feedback.
Tip
You can practice shooting eight hours a day, but if your technique is wrong, then all you become is very good at shooting the wrong way.
\u2014 Michael Jordan
"},{"location":"guides/writing-code/how-we-learn-to-write-code/#how-we-learn-to-write-papers","title":"How we learn to write papers","text":"Throughout our research careers, we are continually learning and developing our ability to write scientific papers. One of the main ways that we develop this ability is to seek feedback early and often, by circulating drafts to co-authors, supervisors, and trusted colleagues.
This feedback not only helps us improve the paper that we're currently working on, but also improves our ability to write papers in the future.
We gradually learn how to express ourselves clearly at multiple levels:
Writing individual sentences that clearly convey a single thought or observation;
Constructing paragraphs that span a single topic or idea;
Structuring an entire paper so that the reader can easily navigate it.
Many of us learn to write code as a by-product of our chosen research area, and may not have any formal computer programming training. However, while we may make our finished code available as a support material for our published papers, we don't typically show our code to our co-authors.
Info
While there are many reasons why we are reluctant to share our code, perhaps the biggest factor is a sense of shame. We may feel that our code is \"bad\" \u2014 too bad to share with others! \u2014 and that if we've ever made a mistake in our code, we're the only person who has ever done so.
This is simply untrue!
"},{"location":"guides/writing-code/how-we-learn-to-write-code/#how-we-should-learn-to-code","title":"How we should learn to code","text":"We should treat writing code the same way that we treat writing papers, grant applications, and fellowship applications: seek feedback early, and seek feedback often.
Question
Wouldn't you prefer that the first person who looks at your code is a trusted colleague, rather than a random person who has read your paper and now wants to see how the code works?
Peer code review offers a structured way to:
Discuss and critique a person's work in a kind and supportive manner;
Praise good work;
Identify where code is well-structured and clear, and where it could be improved; and
Share relevant knowledge and expertise.
Similar to writing papers, we should seek feedback at multiple levels:
Are individual lines of code clear and correct?
Are strongly-related lines of code grouped into functions that each do a single thing?
Are functions grouped into modules that focus on specific aspects or features?
Can the reader easily navigate the code?
One of our goals for 2024 is to develop orientation materials for new students, postdocs, etc. There was broad interest in having a checklist, and example workflows for people to follow \u2014 particularly for projects that involve some form of code \"hand over\", to ensure that the recipients experience few problems in running the code themselves.
How to contribute
To suggest a new topic:
Use the search box (top of the page) and check if the topic already exists;
If the topic does not exist, submit a \"New Topic\" issue.
To suggest a useful resource: submit a \"Useful Resource\" issue.
To provide feedback about existing content: submit a \"Feedback\" issue.
To contribute new content:
Use the search box (top of the page) and check if similar content already exists;
If there is no similar content, please create a fork of this repository, add your contributions, and create a pull request.
See our How to contribute page for more details, such as formatting guidelines.
Current issues for the orientation guide are listed here.
Suggested topics included:
Note
In addition to the topical guides, the Useful resources section includes:
These materials aim to support early- and mid-career researchers (EMCRs) in the SPECTRUM and SPARK networks to develop their computing skills, and to make effective use of available tools1 and infrastructure2.
"},{"location":"#structure","title":"Structure","text":"Start with the basics: Our orientation tutorials provide overviews of essential skills, tools, templates, and suggested workflows.
Learn more about best practices: Our topical guides explain a range of topics and provide exercises to test your understanding.
Come together as a community: Our Community of Practice is how we come together to share skills, knowledge, and experience.
"},{"location":"#motivation","title":"Motivation","text":"Question
Why dedicate time and effort to learning these skills? There are many reasons!
The overall aim of these materials is help you conduct code-driven research more efficiently and with greater confidence.
Hopefully some of the following reasons resonate with you.
Fearlessly modify your code, knowing that your past work is never lost, by using version control with git.
Verify that your code behaves as expected, and get notified when it doesn't, by writing tests.
Ensure that your results won't change when running on a different computer by \"baking in\" reproducibility.
Improve your coding skills, and those of your colleagues, by working collaboratively and making use of peer code review.
Run your code quickly, and without relying on your own laptop or computer, by using high-performance computing.
Foundations of effective research
A piece of code is often useful beyond a single project or study.
By applying the above skills in your research, you will be able to easily reproduce past results, extend your code to address new questions and problems, and allow others to build on your code in their own research.
The benefits of good practices can continue to pay off long after the project is finished.
"},{"location":"#license","title":"License","text":"This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Such as version control and testing frameworks.\u00a0\u21a9
Such as the ARDC Nectar Research Cloud and Spartan.\u00a0\u21a9
Here is a list of the contributors who have helped develop these materials:
If you've made use of Git in your research activities, please let us know! We're looking for case studies that highlight how EMCRs are using Git. See the instructions for suggesting new content (below).
"},{"location":"how-to-contribute/#provide-comments-and-feedback","title":"Provide comments and feedback","text":"The easiest way to provide comments and feedback is to create an issue. Note that this requires a GitHub account. If you do not have a GitHub account, you can email any of the authors. Please include \"Git is my lab book\" in the subject line.
"},{"location":"how-to-contribute/#suggest-modifications-and-new-content","title":"Suggest modifications and new content","text":"This book is written in Markdown and is published using Material for MkDocs. See the Material for MkDocs Reference for an overview of the supported features.
You can suggest modifications and new content by:
Forking the book repository;
Adding, deleting, and/or modifying book chapters in the docs/
directory;
Recording your changes in one or more git commits; and
Creating a pull request, so that we can review your suggestions.
Info
You can also edit any page by clicking the \"Edit this page\" button () in the top-right corner. This will start the process described above by forking the book repository.
Tip
When editing Markdown content, please start each sentence on a separate line. Also check that your text editor removes trailing whitespace.
This ensures that each commit will contain only the modified sentences, and makes it easier to inspect the repository history.
Tip
When you add a new page, you must also add the page to the nav
block in mkdocs.yml
.
You can display content in multiple tabs by using ===
. For example:
=== \"Python\"\n\n ```py\n print(\"Hello world\")\n ```\n\n=== \"R\"\n\n ```R\n cat(\"Hello world\\n\")\n ```\n\n=== \"C++\"\n\n ```cpp\n #include <iostream>\n\n int main() {\n std::cout << \"Hello World\";\n return 0;\n }\n ```\n\n=== \"Shell\"\n\n ```sh\n echo \"Hello world\"\n ```\n\n=== \"Rust\"\n\n ```rust\n fn main() {\n println!(\"Hello World\");\n }\n ```\n
produces:
PythonRC++ShellRustprint(\"Hello world\")\n
cat(\"Hello world\\n\")\n
#include <iostream>\n\nint main() {\n std::cout << \"Hello World\";\n return 0;\n}\n
echo \"Hello world\"\n
fn main() {\n println!(\"Hello World\");\n}\n
"},{"location":"how-to-contribute/#adding-terminal-session-recordings","title":"Adding terminal session recordings","text":"You can use asciinema to record a terminal session, and display this recorded session with a small amount of HTML and JavaScript. For example, the following code is used to display the where-did-this-line-come-from.cast
recording in a tab called \"Video demonstration\", as shown in Where did this line come from? chapter:
=== \"Video demonstration\"\n\n <div id=\"demo\" data-cast-file=\"../where-did-this-line-come-from.cast\"></div>\n
You can also add links that jump to specific times in the video. Each link must have:
data-video
attribute that identifies the video (in the example above, this is \"demo\"
);data-seek-to
attribute that identifies the time (in seconds) to jump to; andhref
attribute that is set to \"javascript:;\"
(so that the link doesn't scroll the page).For example, the following code is used to display the video recording on the Choosing your Git Editor:
=== \"Git editor example\"\n\n <div id=\"demo\" data-cast-file=\"../git-editor-example.cast\"></div>\n\n Video timeline:\n\n 1. <a data-video=\"demo\" data-seek-to=\"4\" href=\"javascript:;\">Overview</a>\n 2. <a data-video=\"demo\" data-seek-to=\"17\" href=\"javascript:;\">Show how to use nano</a>\n 3. <a data-video=\"demo\" data-seek-to=\"71\" href=\"javascript:;\">Show how to use vim</a>\n
You can use the asciinema-scripted tool to generate scripted recordings.
"},{"location":"community/","title":"Community of Practice","text":"Info
Communities of Practice are groups of people who share a concern or a passion for something they do and learn how to do it better as they interact regularly.
The community acts as a living curriculum and involves learning on the part of everyone.
The aim of a Community of Practice (CoP) is to come together as a community and engage in a process of collective learning in a shared domain. The three characteristics of a CoP are:
Community: An environment for learning through interaction;
Practice: Specific knowledge shared by community members; and
Domain: A shared interest, problem, or concern.
We regularly meet as a community, report meeting summaries, and collect case studies that showcase good practices.
"},{"location":"community/#training-events","title":"Training events","text":"To support skill development, we have the capacity to prepare and deliver bespoke training events as standalone session and as part of larger meetings and conferences. See our Training events page for further details.
"},{"location":"community/case-studies/","title":"Case studies","text":"This section contains interesting and useful examples of incorporating Git into a research activity, as contributed by EMCRs in our network.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/","title":"Pen and paper - a less user-friendly form of version control than Git","text":"Author: Trish Campbell (patricia.campbell@unimelb.edu.au
)
Project: Pertussis modelling
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#the-problem","title":"The problem","text":"In this project, I developed a compartmental model of pertussis to determine appropriate vaccination strategies. While plotting some single model simulations, I noticed anomalies in the modelled output for two experiments. The first experiment had an order of magnitude more people in the infectious compartments than in the second experiment, even though there seemed to be far fewer infections occurring. This scenario did not fit with the parameter values that were being used. In the differential equation file for my model, in addition to extracting the state of the model (i.e., the population in each compartment at each time step), for ease of analysis I also extracted the cumulative number of infections up to that time step. The calculation for this extraction of cumulative incidence was incorrect.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#the-solution","title":"The solution","text":"The error occurred because susceptible people in my model were not all equally susceptible, and I failed to account for this when I calculated the cumulative number of infections at each time step. I identified that this was the problem by running some targeted test parameter sets and observing the changes in model output. The next step was to find out how long this bug had existed in the code and which analyses had been affected. While I was using version control, I tended to make large infrequent commits. I did, however, keep extensive hand-written notes in lab books, which played the role of a detailed history of commits. Searching through my historical lab books, I identified that I had introduced this bug into the code two years earlier. I was able to determine which parts of my results would have been affected by the bug and made the decision that all experiments needed to be re-run.
"},{"location":"community/case-studies/campbell-pen-and-paper-version-control/#how-version-control-helped","title":"How version control helped","text":"Using a pen and paper form of version control enabled me to pinpoint the introduction of the error and identify the affected analyses, but it was a tedious process. While keeping an immaculate record of changes that I had made was invaluable, imagine how much simpler and faster the process would have been if I had been a regular user of an electronic version control system such as Git!
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/","title":"Incorrect data in a pre-print figure","text":"Author: Rob Moss (rgmoss@unimelb.edu.au
)
Project: COVID-19 scenario modelling (public repository)
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#the-problem","title":"The problem","text":"Our colleague James Trauer notified us that they suspected there was an error in Figure 2 of our COVID-19 scenario modelling pre-print article. This figure showed model predictions of the daily ICU admission demand in an unmitigated COVID-19 pandemic, and in a COVID-19 pandemic with case targeted public health measures. I inspected the script responsible for plotting this figure, and confirmed that I had mistakenly plotted the combined demand for ward and ICU beds, instead of the demand for ICU beds alone.
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#the-solution","title":"The solution","text":"This mistake was simple to correct, but the obvious concern was whether any other outputs related to ICU bed demand were affected.
We conducted a detailed review of all data analysis scripts and outputs, and confirmed that this error only affected this single manuscript figure. It had no bearing on the impact of the interventions in each model scenario. Importantly, it did not affect any of the simulation outputs, summary tables, and/or figures that were included in our reports to government.
The corrected figure can be seen in the published article.
"},{"location":"community/case-studies/moss-incorrect-data-pre-print/#how-version-control-helped","title":"How version control helped","text":"Because we used version control to record the development history of the model and all of the simulation analyses, we were able to easily inspect the repository state at the time of each prior analysis. This greatly simplified the review process, and ensured that we were inspecting the code exactly as it was when we produced each analysis.
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/","title":"Fixing a bug in pypfilt","text":"Author: Rob Moss (rgmoss@unimelb.edu.au
)
Project: pypfilt, a bootstrap particle filter for Python
Date: 27 October 2021
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/#overview","title":"Overview","text":"I introduced a bug when I modified a function in my pypfilt package, and only detected the bug after I had created several more commits.
To resolve this bug, I had to:
Notice the bug;
Identify the cause of the bug;
Write a test case to check whether the bug is present; and
Fix the bug.
I noticed that a regression test1 was failing: re-running a set of model simulations was no longer generating the same output. The results had changed, but none of my recent commits should have had this effect.
I should have noticed this when I created the commit that introduced this bug, but:
I had not pushed the most recent commits to the upstream repository, where all of the test cases are run automatically every time a new commit is pushed; and
I had not run the test cases on my laptop after making each of the recent commits, because this takes a few minutes and I was lazy.
I knew that the bug had been introduced quite recently, and I knew that it affected a specific function: earlier_states()
. Running git blame src/pypfilt/state.py
indicated that the recent commit 408b5f1
was a likely culprit, because it changed many lines in this function.
In particular, I suspected the bug was occurring in the following loop, which steps backwards in time and handles the case where model simulations are reordered:
# Start with the parent indices for the current particles, which allow us\n# to look back one time-step.\nparent_ixs = np.copy(hist['prev_ix'][ix])\n\n# Continue looking back one time-step, and only update the parent indices\n# at time-step T if the particles were resampled at time-step T+1.\nfor i in range(1, steps):\n step_ix = ix - i\n if hist['resampled'][step_ix + 1, 0]:\n parent_ixs = hist['prev_ix'][step_ix, parent_ixs]\n
In stepping through this code, I identified that the following line was incorrect:
if hist['resampled'][step_ix + 1, 0]:\n
and that changing step_ix + 1
to step_ix
should fix the bug.
Note: I could have used git bisect
to identify the commit that introduced this bug, but running all of the test cases for each commit is relatively time-consuming; since I knew that the bug had been introduced quite recently, I chose to use git blame
.
I wrote a test case test_earlier_state()
that called this earlier_states()
function a number of times, and checked that each set of model simulations were returned in the correct order.
This test case checks that:
If the model simulations were not reordered, the original ordering is always returned;
If the model simulations were reordered at some time t_0
, the original ordering is returned for times t < t_0
; and
If the model simulations were reordered at some time t_0
, the new ordering is returned for times t >= t_0
.
This test case failed when I reran the testing pipeline, which indicated that it identified the bug.
"},{"location":"community/case-studies/moss-pypfilt-earlier-states/#fix-the-bug","title":"Fix the bug","text":"With the test case now written, I was able to verify that that changing step_ix + 1
to step_ix
did fix the bug.
I added the test case and the bug fix in commit 9dcf621
.
In the commit message I indicated:
Where the bug was located: the earlier_states()
function;
When the bug was introduced: commit 408b5f1
; and
Why the bug was not detected when I created commit 408b5f1
.
A regression test checks that a commit hasn't changed an existing behaviour or functionality.\u00a0\u21a9
This section contains summaries of each Community of Practice meeting.
8 August 2024: orientation guide planning.
11 July 2024: presentation from Nefel Tellioglu.
13 June 2024: presentation from Cam Zachreson.
9 May 2024: presentation from TK Le.
11 April 2024: ideas and resources for the orientation guide.
19 February 2024: identify goals and activities for 2024.
18 October 2023: sharing experiences about good ways to structure a project.
15 August 2023: changing our research and reproducibility practices.
13 June 2023: exploration of version control, reproducibility, and testing exercises.
17 April 2023: our initial meeting.
This is our initial meeting. The goal is to welcome people to the community and outline how we envision running these Community of Practice meetings.
"},{"location":"community/meetings/2023-04-17/#theme-reproducible-research","title":"Theme: Reproducible Research","text":"Outline the theme and scope for this community.
This is open to all researchers who share an interest in reproducible research and/or related topics and practices; no prior knowledge is required.
For example, consider these questions:
Can you reproduce your current results on a new computer?
Can someone else reproduce your current results?
Can someone else reproduce your current results without your help?
Can you reproduce your own results from, say, 2 years ago?
Can someone else reproduce your own results from, say, 2 years ago?
Can you fix a mistake and update your own results from, say, 2 years ago?
Tip
The biggest challenge can often be remembering what you did and how you did it.
Making small changes to your practices can greatly improve reproducibilty!
"},{"location":"community/meetings/2023-04-17/#how-will-these-meetings-run","title":"How will these meetings run?","text":"Aim to hold these meetings on a (roughly) monthly basis.
Prior to each meeting, we will invite community members to propose a topic or discussion point to be the focus of the meeting. This may be an open question or challenge, an example of good research practices, a useful software tool, etc.
Schedule each meeting to best suit the availability of community members who are particularly interested in the chosen topic.
Each meeting should be hosted by one or more community members, with online participation available to those who cannot attend in person.
At the end of each meeting, we will ask attendees how useful/effective they found the meeting, so that we can better cater to the needs of the community. For example:
What do you think of the session?
What could we do better in the next session?
We will summarise the key observations, findings, and outputs of each meeting in our online materials, and use them to improve and grow our training materials.
Info
To function effectively as a community, we need to support asynchronous discussions in addition to scheduled meetings.
One option is a dedicated mailing list. Other options were suggested:
A Slack workspace (Dave);
A Discord channel (TK);
A Teams channel (Gerry); and
A private GitHub repository, using the issue tracker (Alex).
Using a GitHub issue tracker might also serve as a gentle introduction to GitHub?
"},{"location":"community/meetings/2023-04-17/#supporting-activities-and-resources","title":"Supporting activities and resources?","text":"Are there other activities that we could organise to help support the community?
We have online training materials, Git is my lab book, which should be useful for people who are not familiar with version control.
We also have a SPECTRUM/SPARK peer review team, where people can make their code available for peer review.
We asked each participant to suggest topics that they would like to see covered in future meetings and/or activities. A number of common themes emerged.
"},{"location":"community/meetings/2023-04-17/#version-control-from-theory-to-practice","title":"Version control: from theory to practice","text":"A number of people mentioned now being sure how to get started, or starting with good intentions but ending up with a mess.
Dave: how can I transition from principle to practice?
Ollie: similar to David, I often start well but end up with a mess.
Ruarai: what other have found useful and applied in this space, what options are out there?
Michael: I'm a complete novice, git command lines are a foreign language to me! I'm looking for tips for someone who uses code a lot, experienced at coding but much less so on version control and the use of repositories. What are the first steps to incorporate it into my workflow?
Angus: I'm also relatively new to Git and have been using GitHub Desktop (a GUI for Windows and Mac). I'm not averse to command line stuff but I need to remember fewer arcane commands!
Samik: I use TortoiseGit \u2014 a Windows Shell Interface to Git.
Gray: I resonate with Michael, I do most of my research on my own and describe it in papers. It isn't particularly Git-friendly, I'm keen to learn.
Lauren: everything that everyone has said so far! I've found some good guidelines for how to write reproducible code, starting from the basics all the way to niche topics. Can we use this as a way to share materials that we've sound useful? The British Ecological Society have published guidelines. We could assemble good materials that start from basics.
David: The Society for Open, Reliable, and Transparent Ecology and Evolutionary Biology (SORTEE) also have good materials.
Gerry: I like the idea of reproducibility and I've done a terrible job of it in the past, my repository ends up with thousands of versions of different files. Can you help me solve it?
Josh: Along the same lines of what's been stated. How best to share knowledge of Git and best practices with others in a new research team? How to adjust to their methods of conducting reproducible research, version control, etc?
Punya: not much to add, would really like to know more about version control, I have a basic understanding, what's the standard way of using it, reproducibility and documentation.
Rachel: I strongly support this idea of code reproducibility. Best practice frameworks can be disseminated to modellers in modelling consortia, and they can be very helpful when auditing.
Ella: we're migrating models from Excel to R.
J'Belle: I work for a tiny, very remote health service at the Australian and Papua New Guinea border. We have 17 sources of clinical data, which presents massive challenges in reproducibility and quality assurance. I'm looking to tap into your expertise. How do we manage so many sources of clinical data?
How can we make best use of existing tools and practices, while working with collaborators who have less technical expertise/experience?
Alex: if I start a project with collaborators who may be less technically literate, how can they contribute without messing up reproducibility? Options like Docker are a little too complicated. How can I motivate people, is there a simple solution?
Angus: in theory you may have reproducible code. But if you need to write a paper with less technical collaborators, running the code and generating reports can be hard. How do we collaborate on the writing side? RMarkdown and equivalents makes a lot of sense, but most colleagues will only look at Word documents. There are some workarounds, such as pandoc.
How far can/should we go in validating and demonstrating that our models and analyses are reproducible? How can we automate this? How is this impacted when we cannot share the input data, or when our models are extremely large and complex?
Cam: there are unique issues in the type of research we do. Working with code makes it easy in some ways, as opposed to experimental conditions in real-world experiments. Our capacity for reproducibility is great, but so then is our burden. We should be exploring the limitations! Some challenges in our area come down to implementation of stochastic models with lots of random processes. How can we do that well and make it part of what we practice? What are the limitations to reproducibility and how do we perceive the goals when we are working when the data cannot be shared?
Samik: similar to Cam, I'm interested in how people have produced reproducible research where the data cannot be shared. Perhaps we can provide default examples as test cases?
Michael: I second Cam's points, particularly about reproducibility with confidential data. That's an issue I've hit multiple times. We usually have a side folder with the real dataset, and pipe through condensed or aggregated versions of the data that we can release.
Jiahao: I'm interested in how to build a platform for using agent based models. I've looked at lots of other models, but how can we bring them together so that it is easier to add a new variable or extend a model?
Eamon: I'm a Git fanatic, and I want to understand the development of code that I work with. I get annoyed when people share a repository as a single commit. People who don't use tags in their Git repositories to identify the version of the code they used in, e.g., a paper! How do you start running the code? What file formats does it expect to process?
Dion: I'm interested in seeing what people are doing that look like good practice. Making sure that code and results are reproducible, in the sense that your code may be version controlled, but you've since made changes to code, parameters, input data, etc. How do you do a good job to shoe-horn that all into Git? Maybe use Git for development and simultaneously use a separate repository for production stuff? We need to be able to look back and identify from the commit the code, data, intermediate files used along the way.
Palang: I've looked at papers with supplementary code, but running the code myself produces very different results from what's in the paper.
May: most people have said what I wanted to say. I faced similar problems with running other people's code. It may not print any error message, but you get very different results from what's published in the paper. You don't know who or what is wrong!
How can we develop confidence in our own code, and in other people's code?
TK: I want to learn/see different conventions for writing code documentation. I've never managed to get doxygen working to my satisfaction.
Angus: how do we design good tests? How to test, when to test, what to test for? Should we have coverage targets? Are there ways to automated testing?
Rahmat: I often find it very hard to learn how to use other people's code. The code needs to be easy to understand. Otherwise, I will just write the code myself! Sometimes when I run the code, I have difficulties in generating results, many errors come up and it's not clear why. Perhaps all of the necessary data have not been shared with the code? We need to include the data, but if the data cannot be provided, you need to provide similar data so that other can run the code. It also helps to use a language that others are familiar with.
Pan: I am not sure about the term reproducibility in the context of coders. I know lab people really do reuse published protocols. But do coders actually reuse other people's code to do their work?
Gerry: People often make their code into packages which others reuse. This could be a good topic for future meetings.
Pan: I recently joined a meeting where people have used Chat GPT to check their code. Does this group have any thoughts on how we might make good use of Chat GPT?
Cam: Chat GPT is not reproducible itself, so it seems questionable to use it to check reproducibility.
Alex: I don't entirely agree, it can be very useful for improving the implementation of a function. In terms of generating reliable code, it's wonderful. It's a nightmare for evaluating existing code.
Pan: people are using Chat GPT to generate initial templates.
Eamon: If you encounter code that has poor documentation, Chat GPT is surprisingly good at telling you how to use it.
Matt: I don't have anything to add to the above, I'm happy to be along for the ride.
In this meeting we asked participants to share their experiences exploring the version control, reproducibility, and testing exercises in our example repository.
This repository serves an introduction to testing models and ensuring that their outputs are reproducible. It contains a simple stochastic model that draws samples from a normal distribution, and some tests that check whether the model outputs are consistent with our expectations.
"},{"location":"community/meetings/2023-06-13/#what-is-a-reproducible-environment","title":"What is a reproducible environment?","text":"The exercise description was deliberately very open, but it may have been too vague:
Define a reproducible environment in which the model can run.
We avoided listing possible details for people to consider, such as software and package versions. Perhaps a better approach would have been to ask:
If this code was provided as supporting materials for a paper, what other information would you need in order to run it and be confident of obtaining the same results as the original authors?
The purpose of a reproducible environment is to define all of these details, so that you never have to say to someone \"well, it runs fine on my machine\".
"},{"location":"community/meetings/2023-06-13/#reproducibility-and-stochasticity","title":"Reproducibility and stochasticity","text":"Many participants observed that the model was not reproducible unless we used a random number generator (RNG) with a known seed, which would ensure that the model produces the same output each time we run it.
But what if you're using a package or library that internally uses their own RNG and/or seed? This may not be something you can fix, but you should be able to detect it by running the model multiple times with the same seed, and checking whether you get identical result each time.
Another important question was raised: do you, or should you, include the RNG seed in your published code? This is probably a good idea, and suggested solutions included setting the seed at the very start of your code (so that it's immediately visible) or including it as a required model parameter.
"},{"location":"community/meetings/2023-06-13/#writing-test-cases","title":"Writing test cases","text":"Tip
Write a test case every time you find a bug: ensure that the test case finds the bug, then fix the bug, then ensure that the test case passes.
A test case is a piece of code that checks that something behaves as expected. This can be as simple as checking that a mathematical function returns an expected value, to running many model simulations and verifying that a summary statistic falls within an expected range.
Rather than trying to write a single test that checks many different properties of a piece of code, it can be much simpler and quicker to write many separate tests that each check a single property. This can provide more detailed feedback when one or more test cases fail.
Note
This approach is similar to how we rely on multiple public health interventions to protect against disease outbreaks! Consider each test case as a slice of Swiss cheese \u2014 many imperfect tests can provide a high degree of confidence in our code.
"},{"location":"community/meetings/2023-06-13/#writing-test-cases-for-conditions-that-may-fail","title":"Writing test cases for conditions that may fail","text":"If you are testing a stochastic model, you may find certain test cases are difficult to write.
For example, consider a stochastic SIR model where you want to test that an intervention reduces the number of cases in an outbreak. You may, however, observe that in a small proportion of simulations the intervention has no effect (or it may even increase the number of cases).
One approach is to run many pairs of simulations and only check that the intervention reduced the number of cases at least X% of the time. You need to decide how many simulations to run, and what is an appropriate value for X%, but that's okay! Remember the Swiss cheese analogy, mentioned above.
"},{"location":"community/meetings/2023-06-13/#testing-frameworks","title":"Testing frameworks","text":"If you have more than 2 or 3 test cases, it's a good idea to use a testing framework to automatically find your test cases, run each test, record whether it passed or failed, and report the results. These frameworks are usually specific to a single programming language.
Some commonly-used frameworks include:
Multiple participants reported some difficulties in setting up GitHub actions and knowing how to adapt available templates to their needs. See the following examples:
We will aim to provide a GitHub action workflow for each model, and add comments to explain how to adapt these templates.
Warning
One downside of using GitHub Actions is the limited computation time of 2,000 minutes per month. This may not be suitable for large agent-based models and other long-running tasks.
"},{"location":"community/meetings/2023-06-13/#pull-requests","title":"Pull requests","text":"At the time of writing, three participants have contributed pull requests:
TK added a default seed so that the model outputs are reproducible.
Micheal added a MATLAB version of the model and the test cases.
Cam added several features, such as recording metadata about the Python environment and testing that the model outputs are reproducible.
Tip
If you make your own copy (\"fork\") of the example repository, you can create as many commits as you want. GitHub will display a message that says:
This branch is N commits ahead of rrcop:master.
Click on the \"N commits ahead\" link to see a summary of your new commits. You can then click the big green button \"Create pull request\".
This will not modify the example repository. Instead, it will create an overview of the changes between your code and the example repository. We can then review these changes, make suggestions, you can add new commits, etc, before deciding whether to add these changes to the example repository.
"},{"location":"community/meetings/2023-08-15/","title":"15 August 2023","text":"Info
See the Resources section for links to useful resources that were mentioned in this meeting.
"},{"location":"community/meetings/2023-08-15/#changes-to-practices","title":"Changes to practices","text":"In this meeting we asked everyone what changes (if any) they have made to their research and reproducibility practices since our last meeting.
A common theme was improving how we note and record our past actions. For example:
Eamon has begun recording the commit ID (\"hash\") of the code that was used to generate each set of outputs. This allows him to easily retrieve the exact version of the code that was used to generate any past result and, e.g., generate other outputs of interest.
Pan talked about how their group records raw separately from, but grouped with, the analysis code and processed data that were generated from these raw data. They also record every step of their model-fitting process, which may not always go as smoothly as expected.
This ensures that stakeholders who want to use these models to run their own scenarios can reproduce the baseline scenarios without being modelling experts themselves.
The model is available as an online app.
Gizem asked the group \"How do you choose an appropriate project structure, especially if the project changes over time?\"
Phrutsamon: the TIER Protocol 4.0 provides a template for organising the contents and reproduction documentation for projects that involve working with statistical data.
Rob: there may not be a single perfect solution that addresses everyone's needs. But look back at past projects, and try to imagine how the current project might change in the future. And if you're using version control, don't be afraid to experiment with different project structures \u2014 you can always revert back to an earlier commit.
"},{"location":"community/meetings/2023-08-15/#reviewing-code-as-part-of-manuscript-peer-review","title":"Reviewing code as part of (manuscript) peer review","text":"Rob asked the group \"Has anyone reviewed supporting code when reviewing a manuscript?\"
Ruarai read through R code that was provided with a paper, but was unable to run all of it \u2014 some of the code produced errors.
Similarly, Rob has read R code provided with a paper that used hard-coded paths that did not exist (e.g., \"C:\\Users\\<Author Name>\\...\"
), tried to run code in source files that did not exist, and read data from CSV files that did not exist.
Info
Pan mentioned a fantastic exercise for research students.
Pick a modelling paper that is relevant to their research project, and ask the student to:
This teaches the students that reproducibility is very important, and shows them what they need to do when they publish their own results.
It's important to pick a relatively simple paper, so that this task isn't too complicated for the student. And if the paper is written by a colleague or collaborator, you can contact them to ask for extra details, etc.
"},{"location":"community/meetings/2023-08-15/#using-shiny-to-make-models-availablereproducible","title":"Using Shiny to make models available/reproducible","text":"Pan asked the group \"What do you think about (the extra work involved in) turning R code into Shiny applications, to show that the model is reproducible, and do so in a way that lets others easily make use it?\"
An objective of the COVID-19 International Modelling Consortium (CoMo) is to make models available and usable for non-modellers \u2014 turning models into something that anyone with minimal knowledge can explore.
The model is available as a Shiny app, and is continually being updated and refined. It is currently at version 19! Pan's group is trying to ensure that existing users update to the most recent version, because it can be very challenging and time-consuming to create scenario templates for older model versions. Templates are a good way to help the user define their scenario-specific settings, but it's a nightmare when you change the model version \u2014 it's like working with a new model.
Eamon: this is similar to when software platforms make changes to their APIs. Can you make backwards-compatible changes, or automatically transform old templates to make them compatible with the latest model version? This kind of work is simple to fund when your software is a commercial product, but it's much harder to find funding for academic outputs.
Pan: It's a lot of extra work, without any money to support it. For this consortium we hired several programmers, some for the coding, some specifically for the Shiny app, it involved a lot of resources. That project has now ended, but we've learned a lot and have a good network of collaborators. We still have monthly meetings! This was a special case with COVID-19, because the context changed so quickly. It would be much less of a problem with other diseases, which we better understood.
Gizem: very much in favour of using Shiny to make models available, and recently made a Shiny app for their latest project (currently under review). Because the model is very complicated, we had to pre-calculate model results for specific parameter combinations, and only allow users to choose between these parameter combinations. One reviewer asked for a modified figure to show results for slightly different parameter values, and it was quite simple to address.
Hadley Wickham has written a very good book about developing R Shiny applications. Gizem read a chapter of this book each morning, but found it necessary to practice in order to really understand how to use Shiny.
Info
Learning by doing (experiential learning) is a highly-effective way of convincing people to change their practices. It can be greatly enhanced by engaging as a community.
"},{"location":"community/meetings/2023-08-15/#resources","title":"Resources","text":""},{"location":"community/meetings/2023-08-15/#teaching-reproducibility-and-responsible-workflows","title":"Teaching reproducibility and responsible workflows","text":"The Journal of Statistics and Data Science Education published a special issue: Teaching Reproducibility in November 2022. The accompanying editorial article highlights:
Integrating reproducibility into our practice and our teaching can seem intimidating initially. One way forward is to start small. Make one small change to add an element of exposing students to reproducibility in one class, then make another the next semester. Our students can get much of the benefit of reproducible and responsible workflows even if we just make a few small changes in our teaching. These efforts will help them to make more trustworthy insights from data. If it leads, by way of some virtuous cycle, to us improving our own practice, then even better! Improving our teaching through providing curricular guidance about reproducible science will take time and effort that should pay off in the long term.
This journal issue was followed by an invited paper session with the following presentations:
Collaborative writing workflows: building blocks towards reproducibility
Opinionated practices for teaching reproducibility: motivation, guided instruction, and practice
From teaching to practice: Insights from the Toronto Reproducibility Conferences
Teaching reproducibility and responsible workflow: an editor's perspective
Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.
"},{"location":"community/meetings/2023-08-15/#using-shiny","title":"Using Shiny","text":"Mastering Shiny: an online book that teaches how to create web applications with R and Shiny.
CoMo Consortium App: the COVID-19 International Modelling Consortium (CoMo) has developed a web application for an age-structured, compartmental SEIRS model.
Building reproducible analytical pipelines with R: this article shows how to use GitHub Actions to run R code when you push new commits to a GitHub repository.
GitHub Actions for the R language: this repository provides a variety of GitHub actions for R projects, such as installing specific versions of R and R packages.
See the GitHub actions for Git is my lab book, available here. For example, the build action performs the following actions:
Check out the repository, using actions/checkout
;
Install mdBook and other required tools, using make.
Build a HTML version of the book, using mdBook
.
In this meeting we asked participants to share their experiences about good (and bad) ways to structure a project.
Info
We are currently drafting Project structure and Writing code guidelines.
See the pull request for further details. Please contribute suggestions!
We had six in-person and eight online attendees. Everyone predominantly uses one or more of the following languages:
The tidyverse style guide includes recommendations for naming files. One interesting recommendation in this guide is:
If files should be run in a particular order, prefix each file name with a number. For example:
00_download.R\n01_clean.R\n02_summarise.R\n...\n09_plot_figures.R\n10_generate_tables.R\n
A common starting point is often one or more scripts in the root directory. But we can usually divide a project into several distinct steps or stages, and store the files necessary for each stage in a separate sub-directory.
Tip
Your project structure may change as the project develops. That's okay!
You might, e.g., realise that some files should be moved to a new, or different, sub-directory.
Packaging: Python and R allow you to bundle multiple code files into a \"package\". This makes it easier to use code that is split up into multiple files. It also makes it simpler to test and verify whether your code can be run on a different computer. To create a package, you need to provide some metadata, including a list of dependencies (packages or libraries that your code needs in order to run). When installing a Python or R package, it will automatically install the necessary dependencies too. You test this out on, e.g., a virtual machine to verify that you've correctly listed all of the necessary dependencies.
Version control: the history may be extremely useful for you, but may contain things you don't want to make publicly available. One solution would be to know from the very start what files you will want to make available and what files you do not (e.g., sensitive data files), but this is not always possible. Another, more realistic, solution is to create a new repository, copy over all of the files that you want to make available, and record these files in a single commit. The public repository will not share the history of your project repository, and that's okay \u2014 the public repository's primary purpose is to serve as a snapshot, rather than a complete and unedited history.
"},{"location":"community/meetings/2023-10-18/#locating-files","title":"Locating files","text":"A common concern how to locate files in different sub-directories (e.g., loading code, reading data files, writing output files) without relying on using absolute paths. For loading code, Python and Matlab allow the user to add directories to the search path (e.g., by modifying sys.path
in Python, or calling addpath()
in Matlab). But these are not ideal solutions.
As a general rule, prefer using relative paths instead of absolute paths.
Relative paths are defined relative to the current working directory. For example: sub-directory/file-name
and ../other-directory
.
Absolute paths are defined relative to the root drive or directory. For example: /Users/My Name/...
and C:\\Users\\My Name\\...
.
Absolute paths may not exist on other people's computers.
project/input-data/file-1.csv
and a script file in project/analysis-code/read-input-data.R
, you can locate the data file from within the script with the following code:library(here)\ndata_file <- here(\"input-data/file-1.csv\")\n
Tip
A general solution for any programming language is to break your code into functions, each of which accepts input and/or output file names as arguments (when required). This means that most of your code is entirely independent of your chosen project structure. You can then store/generate all of the file paths in a single file, or in each of your top-level scripts.
"},{"location":"community/meetings/2023-10-18/#peer-review-get-feedback-on-project-structure","title":"Peer review: get feedback on project structure","text":"It can be helpful to get feedback from someone who isn't directly involved in the project. They may view the work from a fresh perspective, and be able to identify aspects that are confusing or unclear.
When inviting someone to review your work, you should identify specific questions or tasks that you would like the reviewer to address.
With respect to project structure, you may want to ask the reviewer to address questions such as:
README.md
file help you to navigate the project?You could also ask the reviewer to look at a specific script or code file, and ask questions such as:
Info
For further ideas about useful peer review activities, and how to incorporate them into your workflow, see the following paper:
Implementing code review in the scientific workflow: Insights from ecology and evolutionary biology, Ivimey-Cook et al., Journal of Evolutionary Biology 36(10):1347\u20131356, 2023.
"},{"location":"community/meetings/2023-10-18/#styling-and-formatting","title":"Styling and formatting","text":"We also discussed opinions about how to name functions, variables, files, etc.
For example, R allows you to use periods (.
) in function and variable names, but the tidyverse style guide recommends only using lowercase letters, numbers, and underscores (_
).
If you review other people's code, and have other people review your code, you might be surprised by the different styles and conventions that people use. When reviewing code, these differences can be somewhat distracting.
Agreeing on, and adhering to, a common style guide can avoid these issues and allow the reviewer to dedicate their attention to actually reading and reasoning about the code.
There are tools to automatically format your code (\"formatters\") and to warn about potential issues, such as unused variables (\"linters\"). Here are some commonly-used formatters and linters for different languages:
Language Style guide(s) Formatter Linter R tidyverse styler lintr Python PEP 8 / The Hitchhiker's Style Guide black ruff Julia style guide JuliaFormatter.jl Lint.jlThere are AI tools that you can use to write, format, and review code, but you will need to check whether the code is correct. For example, GitHub Copilot is a (commercial) tool that accepts natural-language descriptions and generates computer code.
Tip
Feel free to use AI tools as a way to get started, but don't simply copy-and-paste the code they give you without reviewing it.
"},{"location":"community/meetings/2024-02-19/","title":"19 February 2024","text":"In this meeting we asked participants to suggest goals and activities to achieve in 2024.
Note
If you were unable to attend the meeting, you can send us suggestions via email.
We have identified the following goals for 2024:
Develop orientation materials for new students and staff;
Share examples of project repositories and model implementations;
Build expertise in testing your own code;
Use peer code review to share knowledge and develop coding skills;
See the sections below for further details.
"},{"location":"community/meetings/2024-02-19/#orientation-materials","title":"Orientation materials","text":"The first suggestion was to develop orientation materials for new students, postdocs, people coming from wet-lab backgrounds, etc. Suggested topics included:
Info
Some of these topics are already covered in Git is my lab book; see the links above.
There was broad interest in having a checklist, and example workflows for people to follow \u2014 particularly for projects that involve some form of code \"hand over\", to ensure that the recipients experience few problems in running the code themselves.
We aim to address these topics in Git is my lab book, with assistance and feedback from the community. See the How to contribute page for details.
"},{"location":"community/meetings/2024-02-19/#example-projects-and-model-templates","title":"Example projects and model templates","text":"Building on the idea of orientation materials, a number of participants suggested providing example projects and different types of models.
The most commonly used languages in our community are:
As an example, we could demonstrate how to write an age-stratified SEIR ODE model in R and Python, and how to write agent-based models in vector and object-oriented forms.
Info
GitHub allows you to create template repositories, which might be a useful way of sharing such examples. We could host these template repositories in our Community of Practice GitHub organisation.
"},{"location":"community/meetings/2024-02-19/#how-and-why-to-begin-testing-your-code","title":"How (and why!) to begin testing your code","text":"We asked everyone whether they'd ever found bugs in their code, and were relieved to see that yes, all of us have made mistakes! Writing test cases in one way to check that your code is behaving in the way that you expect.
But it can be hard to figure out how to actually write useful tests.
You can make your code easier to test if you structure your code well, and consider how to test it when you start coding.
As an example, Cam mentioned that he had written a stochastic model of a hospital ward, in which there was a queue of tasks. At the end of a shift, some tasks may not have been done, and these are put back on the queue for the next shift. Cam discovered there was a bug in the way this was done, and fixed it. However, later on he reintroduced the same bug. This is precisely the situation where regression tests are useful. In brief:
When you discover a bug in your code, write a test that detects this bug before you fix it.
You now have a test that will identify this bug if you ever make the same mistake again.
But you need to run this test whenever you modify your code.
Continuous Integration (CI) is one way to run tests automatically, whenever you push a commit to a platform such as GitHub or GitLab. See the list of resources shared in our previous meeting for some examples of using CI.
In our community we have a number of people with familiarity and expertise in testing infectious disease models and related code. We need to share this knowledge and help others in the community to learn how to test their code.
"},{"location":"community/meetings/2024-02-19/#peer-code-review","title":"Peer code review","text":"We talked about how we improve our writing skills by receiving feedback on drafts from colleagues, supervisors, etc, and how a similar approach can be extremely useful for improving our coding skills.
Info
A goal for this year is to review each other's code! Note that we have developed some guidelines for peer code review.
Suggestions included:
Before submitting a paper to a journal, ask someone else to run your code.
Rob has been working on a within-host malaria model, which is written in R and uses Continuous Integration to generate R Markdown documents every time the code is updated. He is happy to share this for code review, so that:
Everyone can see a working example of continuous integration; and
He can receive feedback on the code, since he is not an experienced R programmer.
A number of participants expressed a willingness to participate in peer code review.
Angus noted that it can be difficult to identify a discrete chunk of digestible code to offer up for peer review.
We can coordinate peer code review through our Community of Practice GitHub organisation.
"},{"location":"community/meetings/2024-02-19/#sharing-code-but-not-the-original-data","title":"Sharing code but not the original data","text":"Samik mentioned that in a recent paper, The impact of Covid-19 vaccination in Aotearoa New Zealand: A modelling study, the code is provided in a public GitHub repository, but that they do not have permission to share the original health data.
Info
We can frequently encounter this issue when working with public health and clinical data sets.
What are the best practices in this case?
One option is to generate dummy data by, e.g., resampling or adding noise to the original data. You could then inform the reader that they should obtain similar, but not necessarily identical results.
You can also use a checksum algorithm (such as SHA-2) and include the checksum for each original data file in the public repository. This would allow users who can obtain access to the original data to verify that they are using identical data.
In this meeting we asked participants to suggest specific tools, templates, packages, etc, that we should include in our Orientation guide. We used the high-level topics proposed in our previous meeting as a starting point.
Attendance: 7 in person, 9 online.
Git is my lab book updates
We have switched to Material for MkDocs, which greatly improves the user experience.
For example, use the search bar above (press F) to interactively search the entire website.
"},{"location":"community/meetings/2024-04-11/#purpose-of-the-guide","title":"Purpose of the guide","text":"We are aiming to keep the orientation guide short and simple, to avoid overwhelming newcomers.
James Ong: If we can agree on a structure, we can then get people to contribute to specific sections.
TK Le: schedule a one-hour meeting where everyone works on writing content for 30 minutes, and then reviews each others' content for 30 minutes?
"},{"location":"community/meetings/2024-04-11/#project-organisation","title":"Project organisation","text":"Key message
A project's structure may need to change over time, and that's fine. What matters is that the structure is explained.
A common theme raised by a number of participants was deciding how to organise your files, dividing projects into cohesive parts, and explaining the relationships between these parts (i.e., how they interact or come together).
Useful tools mentioned in this conversation included:
Using git to track your files;
Using tools such as the here
package for R to locate your files without resorting to hard-coded paths or changing the working directory;
Defining your project's workflow with a pipeline tool such as the targets
package for R.
Info
We are drafting topical guides about these topics. See the online previews for the following guides:
If you have any suggestions or feedback, please let us know in the pull request!
"},{"location":"community/meetings/2024-04-11/#working-collaboratively","title":"Working collaboratively","text":"Key message
Plan to work collaboratively from the outset. It is highly likely that someone else will end up using your code.
Nick Tierney: you are always collaborating with your future self!
One concern raised was how best to prepare your code for handover.
Pan: You need to think about it from the beginning. There will be more and more people trying to use existing models. I am writing a guideline about vaccination modelling, and referring to readers as the \"model user\" (developers, modellers, end users). If we plan for others to use our model, we need to develop the model in a way that aims to make it easier for people to use.
Reminder
We have developed a topical guide on how to use git for collaborative research.
"},{"location":"community/meetings/2024-04-11/#reviewing-code-and-project-structure","title":"Reviewing code and project structure","text":"Key message
Feedback from other people can be extremely useful to identify sources of confusion.
The earlier that someone can review your code and project structure, the easier it should be to act on their feedback.
Saras Windecker mentioned that the Infectious Disease Ecology and Modelling (IDEM) team organised code review sessions that the team found really informative, but also reminded everyone how hard it is to have guidelines that are consistent and broadly useful.
Question: was the purpose of these sessions to review code, or to review the project structure?
They were intended to review code, but team members found they had to review the project structure before they could begin to understand and improve the code.
Question: What materials, inputs, resources, etc, can we provide people who are dealing with messy code?
Rob Moss reflected on his experience of picking up a within-host malaria RBC spleen model and how difficult it was to identify which parts of the model were complete and which needed further work. He gradually divided the code into separate files, and regularly ran model simulations to verify that the outputs were consistent with the original code.
Info
Rob is happy to share the original model code, discuss how it was difficult to understand, and to walk through how he restructured and documented it. If you're interested, send him an email.
"},{"location":"community/meetings/2024-04-11/#how-to-structure-your-data","title":"How to structure your data","text":"Key message
If data are shared, they often lack the documentation to make them easy to reuse.
Nick Tierney asked if anyone had thoughts on how to structure their data? Consistent with our earlier discussion, he suggested that one of the most important things is to have a README
that explains the project structure. He then shared one of his recent papers Common-sense approaches to sharing tabular data alongside publication.
Question: do you advocate for data to be tidied (\"long\"), etc?
Key message
There are various ways to manage confidential data, each with pros and cons.
Michael Plank asked for ideas about how to manage confidential data when working with collaborators, to ensure that everyone is using the most recent version of the data. Obviously you don't want to commit the data files in your project repository, so the data files should be listed in your .gitignore
file.
One option is to store the confidential data in a separate git repository, with tightly controlled access permissions. You can keep a local copy in a separate directory. If you create a symbolic link to this directory inside your project repository, and add this symbolic link to your .gitignore
file, you can still use tools such as here
to access the data.
A second option, which was used by a number of teams that worked on COVID-19 projects for the Australian Department of Health, was to host the data on a secure data management platform (mediaflux). Every time new data were received, the data management groups would perform various quality checks and generate analysis-ready data files. They would then notify all of the teams, who would each download the latest data files as part of their computational pipeline.
The most suitable solution probably depends on a combination of factors, including:
Key message
Debugging is an important skill, but good coding practices are important for making your code easier to test and debug.
A number of people suggested that the orientation guide should provide some information about how to debug your code.
Nick Tierney: I could go on a long rant about debugging, and why we should be teaching how to divide code into small functions that are easier to test!
We also discussed that there are various ways to debug your code, from printing debugging messages, to using an interactive debugger, to writing test cases that explain how the code should work.
Rob Moss: I've used regression tests and continuous integration (CI) to detect when I accidentally change the outputs of my code/model. For example, my SMC package for Python includes tests cases that record simulation outputs, and after running the tests I check that the output files remain unchanged.
"},{"location":"community/meetings/2024-04-11/#guidelines-for-using-ai","title":"Guidelines for using AI","text":"Key message
Practices such as code review and testing are even more important for code that is created using generative AI.
Pan: speaking for junior students, a lot of students are now using ChatGPT for their coding, either to create the initial structure or to transform code from one programming language to another language.
Question: Can we provide useful guidelines for those students?
James: this probably isn't something we will cover in the orientation guide. But perhaps we need some guidelines for generative AI use in the topical guides.
Testing your code and ensuring it is reproducible is even more important when using ChatGPT. We know it doesn't always give you correct code, so how can you decide whether what it's given you is useful? It would be great to have an example of code generated by ChatGPT that is incorrect or unnecessarily difficult to understand, and to show how we can improve that code.
A question for the community
Does anyone have any examples of code produced by ChatGPT that didn't function as intended?
"},{"location":"community/meetings/2024-04-11/#useful-resources","title":"Useful resources","text":"The following resources were mentioned in this meeting:
Developing a modern data workflow for regularly updated data;
A computational pipeline for the paper Mapping trends in insecticide resistance phenotypes in African malaria vectors;
Common-sense approaches to sharing tabular data alongside publication;
The here
package for R; and
The targets
package for R.
In this meeting TK Le gave a presentation about a series of COVID-19 modelling projects and how their experiences were impacted by the choice of programming languages, model structure, editors, tools, etc.
Attendance: 4 in person, 4 online.
Info
We welcome presentations about research projects and experiences that relate in any way to reproducibility and good computational research practices. Presentations can be short, informal, and free-form.
Please contact us if you have anything you might like to present!
"},{"location":"community/meetings/2024-05-09/#three-projects-four-models","title":"Three projects, four models","text":"This work began in 2022, and was based on code that was originally written by Eamon and Camelia. TK adapted the code to run new scenarios across three separate modelling projects:
The workflow was divided into a series of distinct models:
An immunological model written in R and greta;
An agent-based model of population transmission written in C++ (post-processing written in R);
A clinical pathways model implemented in MATLAB; and
A cost-effectiveness model implemented in R.
TK's primary activities involved implementing different vaccine schedules in the transmission model, and data visualisation of outputs from the transmission model and the clinical pathways, all of which they implemented in Python.
"},{"location":"community/meetings/2024-05-09/#the-multi-model-structure","title":"The multi-model structure","text":"Key message
There isn't necessarily a single best way to structure a large project.
Question: Was it a benefit to have separate model implementations in different languages, with clearly defined data flows from one model to the next?
Conceptually yes, but this structure also involved a lot of trade-offs \u2014 even the sheer volume of data that had to be saved by one model and read into another. It was difficult to pick up and use the code as a whole. And there were related issues, such as being able to successfully install greta.
Nick Tierney: I know that greta can be difficult to install, and I can help people with this.
TK also noted that they found some minor inconsistencies between the models, such as whether a date was used to identify the start or end of its 24-hour interval.
"},{"location":"community/meetings/2024-05-09/#tools-and-platforms","title":"Tools and platforms","text":"Key message
Personal preferences can play a major role in deciding which tools are best for a project.
The various models were hosted in repositories on Bitbucket, GitHub, and the University of Melbourne's GitLab instance. TK noted that the only discernible differences between these platforms was how they handled authorisation and authentication.
TK also explored several different editors, some which were language-specific:
TK noted they had previous experience with Eclipse (an IDE for Java) and Visual Studio (which felt very \"heavy\").
Question: what were your favourite things in VS Code, and what made RStudio the worst?
It was easiest to open a project in VS Code, RStudio would always open up an entire project or previous workspace, rather than just opening a single file. RStudio also kept asking to update itself.
TK also strongly disliked the RStudio font, which was another source of friction. They tried installing an RStudio extension for VS Code, but weren't sure how well it worked.
Nick Tierney: R history is really annoying, but you can turn it off. I'm not sure why it's enabled by default, all of the RStudio recommendations involve turning off these kinds of things.
Rahmat Sagara: I'm potentially interested in using VS Code instead of RStudio.
Eamon Conway: the worst thing about VS code is that debugging is very hard to setup.
"},{"location":"community/meetings/2024-05-09/#task-management","title":"Task management","text":"Key message
Task management is hard, and switching to a new system during a project is extremely difficult.
TK reported trying to use Gitlab issues to plan out what to do and how to do it, but found they weren't a good fit with their workflow. They then trialled Trello boards for different versions, but stopped using them due to a lack of time to manage the boards. In review:
Rob Moss: we know that behaviour changes are hard to achieve, so it's not surprising that a large change was challenging to maintain \u2014 ideally we would make small, incremental changes, but this isn't always possible or useful.
Eamon Conway: I like the general idea of using task-tracking software, but I've settled on only using paper. It's always with me, it's always at home, and it's physically under my keyboard!
Ruarai Tobin: I use Notion with a single large Markdown file, you can paste screenshots.
"},{"location":"community/meetings/2024-05-09/#repository-structure","title":"Repository structure","text":"Key message
There are many factors and competing concerns that influence the repository structure.
The repository structure changed a lot across these three projects.
In the beginning, the main challenge was separating out the different parts. While this was achieved, it wasn't immediately obvious where a user was supposed to start \u2014 the file structure did not make it clear. The README.md
file did, however, include an explanation.
By the final project, the repository was divided into a number of directories, each of which was given a numeric prefix (e.g, 0_data
and 4_post_processing
). However, this was also a little misleading:
In order to run the code you start in the folder numbered 1;
The post processing in 4_post_processing
was interleaved between some of the other steps (it mostly contains plotting and visualisation code, but also some other stuff);
The structure also had to differ between running the code on Monash University's MASSIVE HPC platform, and on the University of Melbourne's Spartan HPC platform.
Question: is there an automated pipeline?
TK replied that the user had to run the code in each of the numbered folders in the correct (ascending) order, and that they wanted to make improvements for automating the dependent jobs on Spartan.
Eamon Conway: if you do ever automate it, we should share it with people (e.g., this community) because people may be able to learn from it when they want to use Spartan. I know how to use slurm for job management and can help you automate it.
"},{"location":"community/meetings/2024-05-09/#data-visualisation","title":"Data visualisation","text":"Key message
Producing good figures takes a lot of time, thought, and experimentation, and also a lot of coding.
It was extremely hard to decide what to show, and how to show it, in order to highlight key messages.
It was very easy to overwhelm the viewer with complicated plots and massive legends. For example, the scenarios involved three epidemic waves, and how can you show relationships between each pair of waves? It is relatively simple to build a 3D plot that shows these relationships, but the viewer can't really interpret the results.
"},{"location":"community/meetings/2024-05-09/#other-activities","title":"Other activities","text":"Key message
Following better practices would have required maybe 50% more time, but there wasn't funding \u2014 who will pay for this?
Dedicating time to other activities was not feasible \u2014 no one had time, these projects had fixed deadlines and it was challenging to complete the work within these deadlines.
As explained above, data visualisation took longer than expected. And sometimes the code simply wouldn't run on high-performance computing platforms. For example, sometimes Matlab just wouldn't load, there were intermittent failures for no apparent reason and with no useful error messages.
Activities that would have been nice to do, but were not undertaken, included:
Rob Moss: we're very unlikely to get funding to explicitly cover these activities. If possible, we need to allocate sufficient time in our budgets, as best as possible. Practising these skills on smaller scales can also help us to use them with less overhead in larger projects.
"},{"location":"community/meetings/2024-05-09/#version-control-and-publishing","title":"Version control and publishing","text":"Key message
This can be challenging even with all of the tools and infrastructure in place.
Question: were all of the projects wrapped up into one central thing?
No, they're all separate. The first project was provided as a zip file attached to the paper. The second project is in a public git repository. The final project is ongoing and remains confidential, it is stored in a private repository on the University of Melbourne's GitLab instance.
Question: did the latest project build on the previous ones?
Yes, and this led to managing current and older versions of the code. For example, TK found a bug that caused a minor discrepancy between times reported in two different models (see The multi-model structure) but it wasn't possible to update the older code and regenerate the associated outputs.
Question: should we use git (and GitHub) only for publication, or during the work itself?
Eamon Conway: Use it from the beginning to track your work, and maybe have different privacy settings (confidential code and/or data).
Rob Moss: you can use a git repository for your own purposes during the work, and upload a snapshot to Figshare or Zenodo to obtain a DOI that you can cite in your paper.
"},{"location":"community/meetings/2024-05-09/#broader-conclusions","title":"Broader conclusions","text":"Changing our behaviour and work habits is hard, and so is finding time to develop these skills. We need to practice these skills on small problems first, rather than on large projects (and definitely not when there are tight deadlines).
A question for the community
Should we organise an event to practice and develop these skills on small-scale problems?
"},{"location":"community/meetings/2024-06-13/","title":"13 June 2024","text":""},{"location":"community/meetings/2024-06-13/#cam-zachreson-a-comparison-of-three-abms","title":"Cam Zachreson: A comparison of three ABMs","text":"In this meeting Cam gave a presentation about the relative merits and trade-offs of three different approached for agent-based models (ABMs).
Attendance: 7 in person, 13 online.
"},{"location":"community/meetings/2024-06-13/#theoretical-frameworks","title":"Theoretical frameworks","text":"Key message
Each framework is built upon different assumptions about space, contacts, and transmission.
Cam introduced three theoretical frameworks for disease transmission, which he used in constructing infectious disease models for three different projects. Note that all three models use the same within-host model for individual dynamics.
Border quarantine for COVID-19: international arrivals, quarantine workers, and the local community are divided into mixing groups within which there is close contact. There is also weaker contact between these mixing goups.
Social isolation in residential aged care facilites: the transmission is a multigraph that explicitly simulates contact between individuals. The graph is dynamic: workers and worker-room assignments can change every day. Workers share N edges when they service N rooms in common.
A single hospital ward (work in progress): a shared space model represents spatial structure as a network of separate spaces (i.e., nodes). Nurses and patients are associated with spaces according to schedules. Each space has its own viral concentration, driven by shedding from infectious people and ventilation (the air changes around 6 times per hour). Residence in a space results in a net viral dose, which confers a probability of infection (using the Wells-Riley model).
Question
Are many short interactions equivalent to one long interaction?
"},{"location":"community/meetings/2024-06-13/#pros-and-cons-of-model-structures","title":"Pros and cons of model structures","text":"Key message
Each framework has unique strengths and weaknesses.
As shown in the slide below, Cam identified a range of pros and cons for each modelling framework. Some of the key trade-offs between these frameworks are:
The ease of validation (aged care and hospital ward) versus the ease of communication (quarantine);
Having explicit physical parameters and units (hospital ward) versus having vague and/or phenomenological parameters (quarantine and aged care); and
Being simple to construct and efficient to run (quarantine and aged care) versus being complex to construct and computationally expensive (hospital ward).
Key message
Complex models typically have complex data requirements.
Data requirements can also present a challenge when constructing complex models. For example, behaviour models are good for highly-structured environments such as hospital wards, where nurses have scheduled tasks that are performed in specific spaces. However, the data required to construct the behaviour model can be very hard to collect, access, and use. Even if nurses wear sensors, the sensor data are never sufficiently clean or complete to use without substantial cleaning and processing.
Airflow between spaces in a highly-structured environment is also complex to model. Air exchange can involve diffusion between adjacent spaces, but also airflow between non-adjacent spaces through ventilation systems. These flows can be difficult to identify, and are computationally expensive to simulate (computational fluid dynamics).
Cam concluded by observing that existing hospitals wards tend to have a design flaw for infection control:
There are many shared spaces in which infection can spread among individuals via air transmission.
"},{"location":"community/meetings/2024-06-13/#reproducibility-in-stochastic-models","title":"Reproducibility in stochastic models","text":"Key message
These models rely on random number generators (RNGs), whose outputs can be controlled by defining their initial seed. Using a separate RNG for each process in the model provides further advantages (see below).
In contrast to agent-based models of much larger populations, these models are small enough that they can be run as single-threaded code, and multiple simulations can be run in parallel. The bulk of computational cost is usually sweeping over many combinations of parameter values.
The aged care (multigraph) and hospital ward (shared space) models decouple the population RNG from the transmission dynamics RNG. An advantage of using multiple RNGs is that we can independently control and modify these processes. For example, by using separate RNGs for infections and testing, we can adjust testing practices without affecting the infection process.
"},{"location":"community/meetings/2024-06-13/#topic-for-a-masters-project","title":"Topic for a Masters project","text":"Question
Does anyone know a suitable Masters student?
Cam is looking for a Masters student to undertake a project that will look at individual-level counterfactual scenarios. The key idea is to identify sets of preconditions (e.g., salient details of the event history and/or current epidemic context) and ensure that the model will always generate the same outcome when given these preconditions. An open question is how far back in the event history is necessary/sufficient.
"},{"location":"community/meetings/2024-07-11/","title":"11 July 2094","text":""},{"location":"community/meetings/2024-07-11/#nefel-tellioglu-lessons-learned-from-pneumococcal-vaccine-modelling","title":"Nefel Tellioglu: Lessons learned from pneumococcal vaccine modelling","text":"In this meeting Nefel gave a presentation about a pneumococcal vaccine (PCV) evaluation project for government, sharing her experiences in developing a model from scratch under tight deadlines.
Attendance: 6 in person, 6 online.
Info
We welcome presentations about research projects and experiences that relate in any way to reproducibility and good computational research practices. Presentations can be short, informal, and free-form.
Please contact us if you have anything you might like to present!
"},{"location":"community/meetings/2024-07-11/#computational-performance","title":"Computational performance","text":"Key message
Optimisation is a skill that takes time to learn.
This project involved constructing an agent-based model (ABM) of pneumococcal disease, incorporating various vaccination assumptions and intervention strategies. Nefel was familiar with an existing ABM framework written in Python, but found that the project requirements (a large population size and long simulation time-frames) meant that a different approach was required.
Asking for help in a new skill: model optimisation for each vaccine type and multi-strains
They ended up implementing a model from scratch, using the Polars data frame library to represent each individual as a separate row in a single population data frame. This library is designed for high performance, and Nefel was able to implement a model that ran quickly enough for the purposes of this project.
An introduction to Polars workshop?
Nefel asked whether other people would be interested in an \"Introduction to Polars\" workshop, and a number of participants indicated interest.
"},{"location":"community/meetings/2024-07-11/#workflows-and-deadlines","title":"Workflows and deadlines","text":"Key message
Using version control makes it much easier to fix your code when it breaks.
Nefel made frequent use of a git repository (hosted on GitHub) in the early stages of the project. She found it very useful during the model prototyping phase, when adding new features frequently broke the code in some way. Having immediate access to previous versions of the code made it much easier to revert changes and fix the code.
However, she stopped using it when the project reached a series of tight deadlines.
"},{"location":"community/meetings/2024-07-11/#asking-for-extensions","title":"Asking for extensions","text":"Key message
Being able to provide advance warning of potential delays, and to explain the reasons why they might occur, is extremely helpful for everyone. This allows project leaders and stakeholders to adjust their plans and expectations.
It's generally hard to estimate feasible timelines in advance. This is especially difficult when exploring a new problem, and when a range of future extensions are anticipated.
These kinds of conversations can feel extremely uncomfortable. Several participants reflected on their own experiences, and agreed that informing their supervisors about potential problems as early as possible was the best approach.
Things can take longer than expected due to the research nature of building a new infectious disease model. Where possible, avoid promising that a model will be completed by a certain time. Instead, give stakeholders regular updates about progress and challenges, so that they can appreciate how much effort that is being applied to the problem.
Gizem: stakeholders may not know what they want or need from the model. It's really helpful to clarify this early in the project, which needs a good working relationship.
Eamon: writing your code in a modular way can help make it easier to implement those future extensions. Experience also helps in designing your code so that future extensions only modify small parts of your model. But avoid trying to make your code as abstract and extensible as possible.
Rob: if you know that the model will be applied to many different scenarios in the future, try to separate the code that defines the location of data files from the code that uses those data. That makes it easier to run your model using different sets of input data.
"},{"location":"community/meetings/2024-07-11/#related-libraries-for-python-and-r","title":"Related libraries for Python and R","text":"Key message
There are a number of high-performance data frame libraries.
Polars primarily supports Python, Rust, and JavaScript. There is also an R package that has several extensions, including:
polarssql: a Polars backend for DBI and dbplyr; and
tidypolars: tidyverse syntax for Polars.
Other high-performance data frame options for R:
data.table: a high-performance data.frame
replacement;
DBI: a package for working with various databases; and
dbplyr: a database backend for dplyr.
DuckDB is another high-performance library for working with databases and tabular data, and is available for many languages including R, Python, and Julia. It also integrates with Polars, allowing you to query Polars data frames and to save outputs as Polars data frames.
"},{"location":"community/meetings/2024-07-11/#conclusions","title":"Conclusions","text":"Key message
Once a project is completed, it's worth reflecting on what worked well, and on what you would do differently next time.
Nefel finished by reflecting on what she might do differently next time, and highlighting two key points:
Begin with a clearer understanding of the skills required for the project, such as modelling large populations and code optimisation.
Where there are potential skill gaps, involve other people in the project who can contribute relevant expertise.
At our next meeting \u2014 currently scheduled for Thursday 8 August \u2014 we will work on finalising our Orientation Guide checklist, collect supporting materials for each item on the checklist, and begin drafting content where no suitable supporting materials can be found.
"},{"location":"community/meetings/2024-08-08/","title":"8 August 2024","text":""},{"location":"community/meetings/2024-08-08/#orientation-guide","title":"Orientation guide","text":"Key message
The aim for the Orientation Guide is to provide a short overview of import concepts, useful tools and resources, and links to relevant tutorials and examples.
In this meeting we discussed how the Orientation Guide could best address the needs of new students and staff. We began by asking participants what skills, tools, and knowledge they've found to be particularly useful, and wish they'd discovered earlier.
Attendance: 5 in person, 2 online.
"},{"location":"community/meetings/2024-08-08/#core-tools-and-recommended-packages","title":"Core tools and recommended packages","text":"Key message
There was strong interest in having opinionated recommendations for helpful software packages and libraries, based on our own experiences.
When we start out, we typically don't know what tools are available and how to choose between them. So having guidance and recommendations from more experienced members of our community can be valuable.
This harks back to TK's presentation and their reflections on choosing the best tools for a specific project or task. For example:
Question
Which editor should a new student use for their project?
We strongly recommend choosing an editor that can automatically format and check your code.
Eamon suggested that in addition to linking to tutorials for installing common tools such as Python and R, the orientation guide should recommend helpful packages. For example:
For R, the tidyr
package and the broader collection of \"tidyverse\" packages.
For Python, Conda is probably the easiest way to install Python and scientific/numeric Python libraries on Windows and OS X (it also supports Linux).
Jacob: it would be nice to have a flowchart or diagram to help identify relevant tools and packages. For example, if you want to (a) analyse tabular data; and (b) use Python; then what package would you recommend? (Our answer was Polars).
"},{"location":"community/meetings/2024-08-08/#reproducible-environments","title":"Reproducible environments","text":"Key message
Virtual environments allow you to install the packages that are required for a specific project, without affecting other projects.
This is useful for a number of reasons, including:
Being able to manage each project independently and in isolation;
Being able to use different versions of packages in different projects; and
Making it simpler to set up and run a project on a new computer.
Python provides built-in tools for virtual environments, and the University of Melbourne's Python Digital Skills Training includes a workshop on Python Virtual Environments.
For R, the renv
packages provides similar capabilities, and the Introduction to renv article provides a good overview.
Key message
Reproducible document formats such as RMarkdown (for R) and Jupyter Notebooks (for Python) provide a way to combine code, data, text, and figures in self-contained and reproducible plain-text files.
For introductions and examples, see:
Nick Tierney's RMarkdown for Scientists;
The Jupyter Notebook documentation; and
The Quarto publishing system is similar to RMarkdown, but allows you to write code in Python, R, and/or Julia, and supports many different output formats.
If you use VS Code to write Quarto documents, when you edit a code block it will open it in Jupyter (for Python code) and this allows you to step through and debug the code to some degree.
"},{"location":"community/meetings/2024-08-08/#existing-training-courses-and-needs","title":"Existing training courses and needs","text":"Key message
There will be an Introduction to Polars workshop at the SPECTRUM 2024 annual meeting (23-25 September), led by Nefel Tellioglu and Julian Carlin.
We asked participants if they had found any training materials that were particularly useful.
Mackrina said that she is using Python in her PhD project, but previously only had experience with Matlab.
Mackrina completed several online Python courses, but found that an in-person training session offered by the University of Melbourne's Digital Skills Training was the most useful. They regularly run a wide range of training sessions, see the list of upcoming sessions.
Mackrina found that the pypfilt package made it much easier to write ordinary differential equation (ODE) models and run scenario simulations. Note: Rob is the pypfilt
author and maintainer. This package is designed for scenario modelling, and model-fitting using Sequential Monte Carlo (SMC) methods.
Other participants chimed in with recommended resources and training needs:
Rob: The Software Carpentry provides a good range of lessons.
TK: I want to learn how to use Docker and containers, and how to (install and) use greta.
Eamon: I'm happy to provide assistance and guidance with using Stan to fit models using Hamiltonian Monte Carlo.
Jiahao: Python ODE solvers are not described nearly as well as Matlab's ODE solvers, so they are harder to use.
Using GPGPUs for high-performance computing
Jiahao asked: Does anyone in our community have experience with using GPGPUs?
In response to Jaihao's question, Eamon replied that he has found it to be near-impossible, due to a combination of:
This initiated a broader discussion about improving the computational performance of our code and making use of high-performance computing (HPC) resources.
Computational performance was an issue that Nefel encountered when constructing an agent-based model of pneumococcal disease, and she found that code optimisation is a skill that takes time to learn.
We discussed several ways about using multiple CPU cores to make code run more quickly:
Using libraries that automatically make use of multiple CPU cores, such as future.apply for R, concurrent.futures for Python, and the Polars data-frame library.
Where we want to run large numbers of simulations, the easiest approach can often be for each simulation to only use one CPU core, and to run many simulations in parallel (e.g., on virtual machines that have many CPU cores). However, as TK pointed out, if each simulation uses a large amount of RAM, it may not be possible to run many simulations in parallel with this approach.
For larger scale problems, there are HPC platforms such as the University of Melbourne's Spartan and Monash University's MASSIVE. Using these platforms typically requires writing some bespoke code to define and schedule jobs.
Key message
There was strong interest in running a debugging workshop at the upcoming SPECTRUM 2024 annual meeting (23-25 September). As TK and Nefel have shown in their presentations, skills like debugging are extremely valuable for projects with tight deadlines, but these projects are also the worst time in which to develop and practice these skills.
Info
Attendees confirmed their willingness to evaluate and provide feedback on workshop draft materials.
Rob reflected that many people struggle to effectively debug their code, and can end up wasting a lot of time. Since we all make mistakes when writing code, this can be an extremely valuable skill. This is particularly true when working on, e.g., modelling contracts with government (see, e.g., the recent presentations from TK and Nefel).
We discussed some general guidelines, such as:
Structuring your code so that it's easier to debug (e.g., small functions);
Avoid hard-coding numerical values, file names, etc, as much as possible;
Making use of breakpoints rather than print()
statements;
Checking that input arguments to a function satisfy necessary conditions;
Checking that output values from a function satisfy necessary conditions;
Failing early (e.g., raising exceptions in Python, calling stop()
in R) rather than returning values that may not be valid.
Eamon: by learning how to debug code, I substantially improved how I write and modularise my code. My functions became smaller, and this helped me to make fewer mistakes.
For example, David Price and I encountered a bug in some C++ code where the function was correct, but made assumptions about the input arguments. These assumptions were initially satisfied, but as other parts of the code were updated, these assumptions were no longer true.
To address this, I often write if
statements at the top of a function to check these kinds of conditions, and stop if there are failures. You can see examples of this in real-world code from popular packages.
James: I'm happy to provide an example of debugging within an R pipe. Learn Debugging might be a useful resource for our community.
Rob: failing early is good, rather than producing output and then having to discover that it's incorrect (which may not be obvious). Related skills include learning how to read a stack trace, and defensive programming (such as checking input arguments, as Eamon mentioned).
TK: it's really hard to change existing habits. And I'm not doing any coding in my projects right now. My most recent coding experiences were in COVID-19 projects (see TK's presentation) and the very tight deadlines didn't allow me the opportunity to develop and apply new skills.
Rob: everyone already debugs and tests their code to some degree, simply by writing and evaluating code line by line (e.g., in an interactive R or Python session) and by running functions with example arguments to check that they give sensible outputs. We just need to \"nudge\" these behaviour to make it more systematic and reproducible.
"},{"location":"community/training/","title":"Training events","text":"We will be running an Introduction to Debugging workshop at the SPECTRUM Annual Meeting 2024 (23-25 September).
"},{"location":"community/training/debugging/","title":"Introduction to Debugging","text":"This workshop was prepared for the SPECTRUM Annual Meeting 2024 (23-25 September).
Tip
We all make mistakes when writing code and introduce errors.
Having good debugging skills means that you can spend less time fixing your code.
See the discussion in our August 2024 meeting for further background.
"},{"location":"community/training/debugging/building-your-skills/","title":"Building your skills","text":"Tip
Whenever you debug some code, consider it as an opportunity to learn, reflect, and build your debugging skills.
Pay attention to your experience \u2014 what worked well, and what would you do differently next time?
"},{"location":"community/training/debugging/building-your-skills/#identifying-errors","title":"Identifying errors","text":"Write a failing test case, this allows you to verify that the bug can be reproduced.
"},{"location":"community/training/debugging/building-your-skills/#developing-a-plan","title":"Developing a plan","text":"What information might help you decide how to begin?
Can you identify a recent \"known good\" version of the code that doesn't include the error?
If you're using version control, have a look at your recent commits and check whether any of them are likely to have introduced or exposed this error.
"},{"location":"community/training/debugging/building-your-skills/#searching-for-the-root-cause","title":"Searching for the root cause","text":"We've shown how a debugger allows you to pause your code and see what it's actually doing. This is extremely helpful!
Tip
Other approaches may be useful, but avoid using trial-and-error.
To quickly confirm or rule out specific suspicions, you might consider using:
print()
statements;assert()
to verify whether specific conditions are met;git bisect
to identify the commit that introduced the error.Is there an optimal solution?
This might be the solution that changes as little code as possible, or it might be a solution that involves modifying and/or restructuring other parts of your code.
"},{"location":"community/training/debugging/building-your-skills/#after-its-fixed","title":"After it's fixed","text":"If you didn't write a test case to identify the error (see above), now is the time to write a test case to ensure you don't even make the same error again.
Are there other parts of your code where you might make a similar mistake, for which you could also write test cases?
Are there coding practices that might make this kind of error easier to find next time? For example, this might involve dividing your code into smaller functions, using version control to record commits early and often.
Have you considered defensive programming practices? For example, at the start of a function it can often be a good idea to check that all of the arguments have valid values.
Are there tools or approaches that you haven't used before, and which might be worth trying next time?
"},{"location":"community/training/debugging/example-square-numbers/","title":"Example: Square numbers","text":"Square numbers are positive integers that are equal to the square of an integer. Here we have provided example Python and R scripts that print all of the square numbers between 1 and 100:
You can download these scripts to run on your own computer:
Each script contains three functions:
main()
find_squares(lower_bound, upper_bound)
is_square(value)
The diagram below shows how main()
calls find_squares()
, which in turn calls is_square()
many times.
sequenceDiagram\n participant M as main()\n participant F as find_squares()\n participant I as is_square()\n activate M\n M ->>+ F: lower_bound = 1, upper_bound = 100\n Note over F: squares = [ ]\n F ->>+ I: value = 1\n I ->>- F: True/False\n F ->>+ I: value = 2\n I ->>- F: True/False\n F -->>+ I: ...\n I -->>- F: ...\n F ->>+ I: value = 100\n I ->>- F: True/False\n F ->>- M: squares = [...]\n Note over M: print(squares)\n deactivate M
Source code PythonR square_numbers.py#!/usr/bin/env python3\n\"\"\"\nPrint the square numbers between 1 and 100.\n\"\"\"\n\n\ndef main():\n squares = find_squares(1, 100)\n print(squares)\n\n\ndef find_squares(lower_bound, upper_bound):\n squares = []\n value = lower_bound\n while value <= upper_bound:\n if is_square(value):\n squares.append(value)\n value += 1\n return squares\n\n\ndef is_square(value):\n for i in range(1, value + 1):\n if i * i == value:\n return True\n return False\n\n\nif __name__ == '__main__':\n main()\n
square_numbers.R#!/usr/bin/env -S Rscript --vanilla\n#\n# Print the square numbers between 1 and 100.\n#\n\n\nmain <- function() {\n squares <- find_squares(1, 100)\n print(squares)\n}\n\n\nfind_squares <- function(lower_bound, upper_bound) {\n squares <- c()\n value <- lower_bound\n while (value <= upper_bound) {\n if (is_square(value)) {\n squares <- c(squares, value)\n }\n value <- value + 1\n }\n squares\n}\n\n\nis_square <- function(value) {\n for (i in seq(value)) {\n if (i * i == value) {\n return(TRUE)\n }\n }\n FALSE\n}\n\nif (! interactive()) {\n main()\n}\n
"},{"location":"community/training/debugging/example-square-numbers/#stepping-through-the-code","title":"Stepping through the code","text":"These recorded terminal sessions demonstrate how to use Python and R debuggers from the command line. They cover:
Manual breakpoints
You can also create breakpoints in your code by calling breakpoint()
for Python, and browser()
for R.
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Python debuggerR debuggerVideo timeline:
is_square()
is_square()
squares
listVideo timeline:
is_square()
is_square()
squares
listPerfect numbers are positive integers that are equal to the sum of their divisors. Here we have provided example Python and R scripts that should print all of the perfect numbers up to 1,000.
You can download each script to debug on your own computer:
#!/usr/bin/env python3\n\"\"\"\nThis script prints perfect numbers.\n\nPerfect numbers are positive integers that are equal to the sum of their\ndivisors.\n\"\"\"\n\n\ndef main():\n start = 2\n end = 1_000\n for value in range(start, end + 1):\n if is_perfect(value):\n print(value)\n\n\ndef is_perfect(value):\n divisors = divisors_of(value)\n return sum(divisors) == value\n\n\ndef divisors_of(value):\n divisors = []\n candidate = 2\n while candidate < value:\n if value % candidate == 0:\n divisors.append(candidate)\n candidate += 1\n return divisors\n\n\nif __name__ == '__main__':\n main()\n
perfect_numbers.R#!/usr/bin/env -S Rscript --vanilla\n#\n# This script prints perfect numbers.\n#\n# Perfect numbers are positive integers that are equal to the sum of their\n# divisors.\n#\n\n\nmain <- function() {\n start <- 2\n end <- 1000\n for (value in seq.int(start, end)) {\n if (is_perfect(value)) {\n print(value)\n }\n }\n}\n\n\nis_perfect <- function(value) {\n divisors <- divisors_of(value)\n sum(divisors) == value\n}\n\n\ndivisors_of <- function(value) {\n divisors <- c()\n candidate <- 2\n while (candidate < value) {\n if (value %% candidate == 0) {\n divisors <- c(divisors, candidate)\n }\n candidate <- candidate + 1\n }\n divisors\n}\n\n\nmain()\n
But there's a problem ...
If we run these scripts, we see that they don't print anything:
How should we begin investigating?
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Some initial thoughts ...Are we actually running the main()
function at all?
The main()
function is almost certainly not the cause of this error.
The is_perfect()
function is very simple, so it's unlikely to be the cause of this error.
The divisors_of()
function doesn't look obviously wrong.
But there must be a mistake somewhere!
Let's use a debugger to investigate.
Here we have provided SIR ODE model implementations in Python and in R. Each script runs several scenarios and produces a plot of infection prevalence for each scenario.
You can download each script to debug on your computer:
#!/usr/bin/env python3\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom scipy.integrate import solve_ivp\n\n\ndef sir_rhs(time, state, popn, beta, gamma):\n \"\"\"\n The right-hand side for the vanilla SIR compartmental model.\n \"\"\"\n s_to_i = beta * state[1] * state[0] / popn # beta * I(t) * S(t) / N\n i_to_r = gamma * state[1] # gamma * I(t)\n return [-s_to_i, s_to_i - i_to_r, i_to_r]\n\n\ndef run_model(settings):\n \"\"\"\n Return the SIR ODE solution for the given model settings.\n \"\"\"\n # Define the time span and evaluation times.\n sim_days = settings['sim_days']\n time_span = [0, sim_days]\n times = np.linspace(0, sim_days, num=sim_days + 1)\n # Define the initial state.\n popn = settings['population']\n exposures = settings['exposures']\n initial_state = [popn - exposures, exposures, 0]\n # Define the SIR parameters.\n params = (popn, settings['beta'], settings['gamma'])\n # Return the daily number of people in S, I, and R.\n return solve_ivp(\n sir_rhs, time_span, initial_state, t_eval=times, args=params\n )\n\n\ndef default_settings():\n \"\"\"\n The default model settings.\n \"\"\"\n return {\n 'sim_days': 20,\n 'population': 100,\n 'exposures': 2,\n 'beta': 1.0,\n 'gamma': 0.5,\n }\n\n\ndef run_model_scaled_beta(settings, scale):\n \"\"\"\n Adjust the value of ``beta`` before running the model.\n \"\"\"\n settings['beta'] = scale * settings['beta']\n return run_model(settings)\n\n\ndef run_model_scaled_gamma(settings, scale):\n \"\"\"\n Adjust the value of ``gamma`` before running the model.\n \"\"\"\n settings['gamma'] = scale * settings['gamma']\n return run_model(settings)\n\n\ndef plot_prevalence_time_series(solutions):\n \"\"\"\n Plot daily prevalence of infectious individuals for one or more scenarios.\n \"\"\"\n fig, axes = plt.subplots(\n constrained_layout=True,\n nrows=len(solutions),\n sharex=True,\n sharey=True,\n )\n for ix, (scenario_name, solution) in enumerate(solutions.items()):\n ax = axes[ix]\n ax.title.set_text(scenario_name)\n ax.plot(solution.y[1], label='I')\n ax.set_xticks([0, 5, 10, 15, 20])\n # Save the figure.\n png_file = 'sir_ode_python.png'\n fig.savefig(png_file, format='png', metadata={'Software': None})\n\n\ndef demonstration():\n settings = default_settings()\n default_scenario = run_model(settings)\n scaled_beta_scenario = run_model_scaled_beta(settings, scale=1.5)\n scaled_gamma_scenario = run_model_scaled_gamma(settings, scale=0.7)\n\n plot_prevalence_time_series(\n {\n 'Default': default_scenario,\n 'Scaled Beta': scaled_beta_scenario,\n 'Scaled Gamma': scaled_gamma_scenario,\n }\n )\n\n\nif __name__ == '__main__':\n demonstration()\n
sir_ode.R#!/usr/bin/env -S Rscript --vanilla\n\nlibrary(deSolve)\nsuppressPackageStartupMessages(library(dplyr))\nsuppressPackageStartupMessages(library(ggplot2))\n\n\n# The right-hand side for the vanilla SIR compartmental model.\nsir_rhs <- function(time, state, params) {\n s_to_i <- params$beta * state[\"I\"] * state[\"S\"] / params$popn\n i_to_r <- params$gamma * state[\"I\"]\n list(c(-s_to_i, s_to_i - i_to_r, i_to_r))\n}\n\n\n# Return the SIR ODE solution for the given model settings.\nrun_model <- function(settings) {\n # Define the time span and evaluation times.\n times <- seq(0, settings$sim_days)\n # Define the initial state.\n popn <- settings$population\n exposures <- settings$exposures\n initial_state <- c(S = popn - exposures, I = exposures, R = 0)\n # Define the SIR parameters.\n params <- list(\n popn = settings$population,\n beta = settings$beta,\n gamma = settings$gamma\n )\n # Return the daily number of people in S, I, and R.\n ode(initial_state, times, sir_rhs, params)\n}\n\n\n# The default model settings.\ndefault_settings <- function() {\n list(\n sim_days = 20,\n population = 100,\n exposures = 2,\n beta = 1.0,\n gamma = 0.5\n )\n}\n\n\n# Adjust the value of ``beta`` before running the model.\nrun_model_scaled_beta <- function(settings, scale) {\n settings$beta <- scale * settings$beta\n run_model(settings)\n}\n\n\n# Adjust the value of ``gamma`` before running the model.\nrun_model_scaled_gamma <- function(settings, scale) {\n settings$gamma <- scale * settings$gamma\n run_model(settings)\n}\n\n\n# Plot daily prevalence of infectious individuals for one or more scenarios.\nplot_prevalence_time_series <- function(solutions) {\n df <- lapply(\n names(solutions),\n function(name) {\n solutions[[name]] |>\n as.data.frame() |>\n mutate(scenario = name)\n }\n ) |>\n bind_rows() |>\n mutate(\n scenario = factor(scenario, levels = names(solutions), ordered = TRUE)\n )\n fig <- ggplot() +\n geom_line(aes(time, I), df) +\n xlab(NULL) +\n scale_y_continuous(\n name = NULL,\n limits = c(0, 40),\n breaks = c(0, 20, 40)\n ) +\n facet_wrap(~ scenario, ncol = 1) +\n theme_bw(base_size = 10) +\n theme(\n strip.background = element_blank(),\n panel.grid = element_blank(),\n )\n png_file <- \"sir_ode_r.png\"\n ggsave(png_file, fig, width = 640, height = 480, units = \"px\", dpi = 150)\n}\n\n\ndemonstration <- function() {\n settings <- default_settings()\n default_scenario <- run_model(settings)\n scaled_beta_scenario <- run_model_scaled_beta(settings, scale=1.5)\n scaled_gamma_scenario <- run_model_scaled_gamma(settings, scale=0.7)\n\n plot_prevalence_time_series(\n list(\n Default = default_scenario,\n `Scaled Beta` = scaled_beta_scenario,\n `Scaled Gamma` = scaled_gamma_scenario\n )\n )\n}\n\ndemonstration()\n
The model outputs differ!
Here are prevalence time-series plots produced by each script:
Python plotR plotModel outputs for the Python script.
Model outputs for the R script.
Interactive debugger sessions
If your editor supports running a debugger, use this feature! See these examples for RStudio, PyCharm, Spyder, and VS Code.
Some initial thoughts ...Is it obvious whether one of the figures is correct and the other is wrong?
The sir_rhs()
functions in the two scripts appear to be equivalent \u2014 but are they?
The default_settings()
functions appear to be equivalent \u2014 but are they?
The run_model_scaled_beta()
and run_model_scaled_gamma()
functions also appear to be equivalent.
Where might you begin looking?
In this workshop, we will introduce the concept of \"debugging\", and demonstrate techniques and tools that can help us efficiently identify and remove errors from our code.
After completing this workshop, participants will:
Understand that debugging can be divided into a sequence of actions;
Understand the purpose of each of these actions;
Be familiar with techniques and tools that can help perform these actions;
Be able to apply these techniques and tools to their own code.
Info
By achieving these learning objectives, participants should be able to find and correct errors in their code more quickly and with greater confidence.
"},{"location":"community/training/debugging/manifesto/","title":"Debugging manifesto","text":"Julia Evans and Tanya Brassie: Debugging Manifesto Poster, 2024.Info
See the Resources page for links to more of Julia Evans' articles, stories, and zines about debugging.
"},{"location":"community/training/debugging/resources/","title":"Resources","text":"Info
Please don't look as these solutions until you have attempted the exercises.
Perfect numbersPerfect numbers are equal to the sum of their proper divisors \u2014 all divisors except the number itself.
For example, 6 is a perfect number. Its proper divisors are 1, 2, and 3, and 1 + 2 + 3 = 6.
The mistake here is that the divisors_of()
function only returns divisors greater than one, and so the code fails to identify any of the true perfect numbers.
Interestingly, this mistake did not result in the code mistakenly identifying any other numbers as perfect numbers.
Python vs RIf you're only familiar with one of these two languages, you may be surprised to discover that they have some fundamental differences. In this exercise we demonstrated one consequence of the ways that these languages handle function arguments.
The R Language Definition
The semantics of invoking a function in R argument are call-by-value. In general, supplied arguments behave as if they are local variables initialized with the value supplied and the name of the corresponding formal argument. Changing the value of a supplied argument within a function will not affect the value of the variable in the calling frame.
\u2014 Argument Evaluation
Python Programming FAQ
Remember that arguments are passed by assignment in Python. Since assignment just creates references to objects, there's no alias between an argument name in the caller and callee, and so no call-by-reference per se.
\u2014 How do I write a function with output parameters (call by reference)?
In R the run_model_scaled_beta()
and run_model_scaled_gamma()
functions do not modify the value of settings
in the demonstration()
function. This produces model outputs for the following parameter combinations:
In Python the run_model_scaled_beta()
and run_model_scaled_gamma()
functions do modify the value of settings
in the demonstration()
function. This produces model outputs for the following parameter combinations:
Answer
The value of \u03b2 is different in the third combination.
"},{"location":"community/training/debugging/understanding-error-messages/","title":"Understanding error messages","text":"Tip
The visible error and its root cause may be located in very different parts of your code.
If there's an error in your code that causes the program to terminate, read the error message and see what it can tell you.
Most of the time, the error message should allow to identify:
What went wrong? For example, did it try to read data from a file that does not exist?
Where did this happen? On which line of which file did the error occur?
When an error occurs, one useful piece of information is knowing which functions were called in order to make the error occur.
Below we have example Python and R scripts that produce an error.
Question
Can you identify where the error occurred, just by looking at the error message?
OverviewPythonRYou can download each script and run them on your own computer:
Traceback (most recent call last):\n File \"stacktrace.py\", line 46, in <module>\n status = main()\n File \"stacktrace.py\", line 7, in main\n do_big_tasks()\n File \"stacktrace.py\", line 17, in do_big_tasks\n do_third_step(i, quiet=quiet)\n File \"stacktrace.py\", line 38, in do_third_step\n try_something()\n File \"stacktrace.py\", line 42, in try_something\n raise ValueError(\"Whoops, this failed\")\nValueError: Whoops, this failed\n
Source code stacktrace.py#!/usr/bin/env python3\n\nimport sys\n\n\ndef main():\n do_big_tasks()\n return 0\n\n\ndef do_big_tasks(num_tasks=20, quiet=True):\n for i in range(num_tasks):\n prepare_things(i, quiet=quiet)\n do_first_step(i, quiet=quiet)\n do_second_step(i, quiet=quiet)\n if i > 15:\n do_third_step(i, quiet=quiet)\n\n\ndef prepare_things(task_num, quiet=True):\n if not quiet:\n print(f'Preparing for task #{task_num}')\n\n\ndef do_first_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #1')\n\n\ndef do_second_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #2')\n\n\ndef do_third_step(task_num, quiet=True):\n if not quiet:\n print(f'Task #{task_num}: doing step #3')\n try_something()\n\n\ndef try_something():\n raise ValueError(\"Whoops, this failed\")\n\n\nif __name__ == \"__main__\":\n status = main()\n sys.exit(status)\n
"},{"location":"community/training/debugging/understanding-error-messages/#the-error-message_1","title":"The error message","text":"Error in try_something() : Whoops, this failed\nCalls: main -> do_big_tasks -> do_third_step -> try_something\nBacktrace:\n \u2586\n 1. \u2514\u2500global main()\n 2. \u2514\u2500global do_big_tasks()\n 3. \u2514\u2500global do_third_step(i, quiet = quiet)\n 4. \u2514\u2500global try_something()\nExecution halted\n
Source code stacktrace.R#!/usr/bin/env -S Rscript --vanilla\n\noptions(error = rlang::entrace)\n\n\nmain <- function() {\n do_big_tasks()\n invisible(0)\n}\n\ndo_big_tasks <- function(num_tasks = 20, quiet = TRUE) {\n for (i in seq_len(num_tasks)) {\n prepare_things(i, quiet = quiet)\n do_first_step(i, quiet = quiet)\n do_second_step(i, quiet = quiet)\n if (i > 15) {\n do_third_step(i, quiet = quiet)\n }\n }\n}\n\nprepare_things <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Preparing for task #\", task_num, \"\\n\", sep = \"\")\n }\n}\n\ndo_first_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #1\\n\", sep = \"\")\n }\n}\n\ndo_second_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #2\\n\", sep = \"\")\n }\n}\n\ndo_third_step <- function(task_num, quiet = TRUE) {\n if (!quiet) {\n cat(\"Task #\", task_num, \": doing step #3\\n\", sep = \"\")\n }\n try_something()\n}\n\ntry_something <- function() {\n stop(\"Whoops, this failed\")\n}\n\nif (! interactive()) {\n status <- main()\n quit(status = status)\n}\n
"},{"location":"community/training/debugging/using-a-debugger/","title":"Using a debugger","text":"The main features of a debugger are:
Breakpoints: pause the program when a particular line of code is about to be executed;
Display/print: show the current value of local variables;
Next: execute the current line of code and pause at the next line;
Continue: continue executing code until the next breakpoint, or the code finishes.
Slightly more advanced features include:
Conditional breakpoints: pause the program when a particular line of code is about to be executed and a specific condition is satisfied.
Step: execute the current line of code and pause at the first possible point \u2014 either the line in the current function or the first line in a function that is called.
For example, consider the following code example:
PythonRdef first_function():\n total = 0\n for x in range(1, 50):\n y = second_function(x)\n total = total + y\n\n return total\n\n\ndef second_function(a):\n result = 3 * a**2 + 5 * a\n return result\n\n\nfirst_function()\n
first_function <- function() {\n total <- 0\n for (x in seq(49)) {\n y <- second_function(x)\n total <- total + y\n }\n total\n}\n\nsecond_function <- function(a) {\n result <- 3 * a^2 + 5 * a\n result\n}\n\nfirst_function()\n
We can use a conditional breakpoint to pause on line 4 (highlighted) only when x = 42
.
We can then use step to begin executing line 4 and pause on line 11, where we will see that a = 42
.
If we instead used next at line 4 (highlighted), the debugger would execute line 4 and then pause on line 5.
Info
Debugging is the process of identifying and removing errors from computer software.
You need to identify (and reproduce) the problem and only then begin fixing it (ideally writing a test case first, to check that (a) you can identify the problem; and (b) to identify if you accidentally introduce the same, or similar, mistake in the future).
"},{"location":"community/training/debugging/what-is-debugging/#action-1-identify-the-error","title":"Action 1: Identify the error","text":"Tip
First make sure that you can reproduce the error.
What observations or outputs indicate the presence of this error?
Is the error reproducible, or does it come and go?
Write a failing test?
"},{"location":"community/training/debugging/what-is-debugging/#action-2-develop-a-plan","title":"Action 2: Develop a plan","text":"Tip
The visible error and its root cause may be located in very different parts of your code.
Identify like and unlikely suspects, what can we rule in/out? What parts of your code recently changed? When was the last time you might have noticed this error?
"},{"location":"community/training/debugging/what-is-debugging/#action-3-search-for-the-root-cause","title":"Action 3: Search for the root cause","text":"Tip
As much as possible, the search should be guided by facts, not assumptions.
Our assumptions about the code can help us to develop a plan, but we need to verify whether our assumptions are actually true.
For example:
Simple errors can often hide\nhide in plain sight and be\nsurprisingly difficult to\ndiscover without assistance.\n
Thinking \"this looks right\" is not a reliable indicator of whether a piece of code contains an error.
Searching at random is like looking for a needle in a haystack. (Perry McKenna, Flickr, 2009; CC BY 2.0)Better approaches involve confirming what the code is actually doing.
This can be done using indirect approaches, such as adding print statements or writing test cases.
It can also be done by directly inspecting the state of the program with a debugger \u2014 more on that shortly!
Tip
It's worth considering if the root cause is a result of deliberate decisions or unintentional mistakes.
Don't start modifying/adding/removing lines based on suspicions or on the off chance that it might work. Without identifying the root cause of the error, there is no guarantee that making the error seem to disappear will actually have fixed the root cause.
"},{"location":"community/training/debugging/what-is-debugging/#action-5-after-its-fixed","title":"Action 5: After it's fixed","text":"Tip
This is the perfect time to reflect on your experience!
What can you learn from this experience? Can you avoid this mistake in the future? What parts of the process were the hardest or took the longest? Are the tools and techniques that might help you next time?
"},{"location":"community/training/debugging/why-are-debuggers-useful/","title":"Why are debuggers useful?","text":"Tip
A debugger is a tool for examining the state of a running program.
Debuggers are useful because they show us what the code is actually doing.
Many of the errors that take a long time for us to find are relatively simple once we find them.
We usually have a hard time finding these errors because:
We read what we expect to see, rather than what is actually written; and
We rely on assumptions about where the mistake might be, and our intuition is often wrong.
Here are some common mistakes that can be difficult to identify when reading through your own code:
Using an incorrect index into an array, matrix, list, etc;
Using incorrect bounds on a loop or sequence;
Confusing the digit \"1\" with letter \"l\";
Confusing the digit \"0\" with letter \"O\".
These materials are divided into the following sections:
Understanding version control, which provides you with a complete and annotated history of your work, and with powerful ways to search and examine this history.
Learning to use Git, the most widely used version control system, which is the foundation of popular code-hosting services such as GitHub, GitLab, and Bitbucket.
Using Git to collaborate with colleagues in a precisely controlled and manageable way.
Learn how to structure your project so that it is easier for yourself and others to navigate.
Learn how to write code so that it clearly expresses your intent and ideas.
Ensuring that your research is reproducible by others.
Using testing frameworks to verify that your code behaves as intended, and to automatically detect when you introduce a bug or mistake into your code.
Running your code on various computing platforms that allow you to obtain results efficiently and without relying on your own laptop/computer.
This page defines the learning objectives for individual sections. These are skills that the reader should be able to demonstrate after reading through the relevant section, and completing any exercises in that section.
"},{"location":"guides/learning-objectives/#version-control-concepts","title":"Version control concepts","text":"After completing this section, you should be able to identify how to apply version control concepts to your existing work. This includes being able to:
Identify projects and tasks for which version control would be suitable;
Categorise recent work activities into one or more commits;
Write commit messages that describe what changes you made and why you made them; and
Identify pieces of work that could be carried out in separate branches of a repository.
After completing this section, you should be able to:
Create a local repository;
Create commits in your local repository;
Search your commit history to identify commits that made a specific change;
Create a remote repository;
Push commits from your local repository to a remote repository;
Pull commits from a remote repository to your local repository;
Use tags to identify important milestones;
Work in a separate branch and then merge your changes into your main branch; and
Resolve merge conflicts.
After completing this section, you should be able to:
Share a repository with one or more collaborators;
Create a pull request;
Use a pull request to review a collaborator's work;
Use a pull request to merge a collaborator's work into your main branch; and
Conduct peer code review in a respectful manner.
After completing this section, you should be able to:
Understand how to structure a new project;
Understand how to separate \"what to do\" from \"how to do it\"; and
Structure your code to enable new experiments and analyses.
After completing this section, you should be able to:
Divide your code into functions and modules;
Ensure that your code is a clear expression of your ideas;
Structure your code into reusable packages; and
Take advantage of code formatters and code linters.
These materials assume that the reader has a basic knowledge of the Bash command-line shell and using SSH to connect to remote computers. You should be comfortable with using the command-line to perform the following tasks:
Please refer to the following materials for further details:
Info
If you use Windows, you may want to use PowerShell instead of Bash, in which case please refer to this Introduction to the Windows Command Line with Powershell.
Some chapters also assume that the reader has an account on GitHub and has added an SSH key to their account.
"},{"location":"guides/resources/","title":"Useful resources","text":""},{"location":"guides/resources/#education-and-commentary-articles","title":"Education and commentary articles","text":"A Beginner's Guide to Conducting Reproducible Research describes key requirements for producing reproducible research outputs.
Why code rusts collects together some of reasons the behaviour of code changes over time.
Point of View: How open science helps researchers succeed presents evidence that open research practices bring significant benefits to researchers.
The Journal of Statistics and Data Science Education published a special issue: Teaching Reproducibility in November 2022. Also see the presentations from an invited paper session:
Collaborative writing workflows: building blocks towards reproducibility
Opinionated practices for teaching reproducibility: motivation, guided instruction, and practice
From teaching to practice: Insights from the Toronto Reproducibility Conferences
Teaching reproducibility and responsible workflow: an editor's perspective
A Quick Guide to Organizing Computational Biology Projects suggests an approach for structuring a computational research repository.
The TIER Protocol 4.0 provides a template for organising the contents and reproduction documentation for projects that involve working with statistical data:
Documentation that meets the specifications of the TIER Protocol contains all the data, scripts, and supporting information necessary to enable you, your instructor, or an interested third party to reproduce all the computations necessary to generate the results you present in the report you write about your project.
A simple kit to use computational notebooks for more openness, reproducibility, and productivity in research provides some good recommendations for organising a project repository and setting up a reproducible workflow using computational notebooks.
NDP Software have created an interactive Git cheat-sheet that shows how git commands interact with the local and upstream repositories, and provides brief documentation for many common examples.
The Pro Git book is available online. It starts with an overview of Git basics and then covers every aspect of Git in great detail.
The Software Carpentry Foundation publishes many lessons, including Version Control with Git.
A Quick Introduction to Version Control with Git and GitHub provides a short guide to using Git and GitHub. It presents an example of analysing publicly available ChIP-seq data with Python. The repository for the article is also publicly available.
CoMo Consortium App: the COVID-19 International Modelling Consortium (CoMo) has developed a Shiny web application for an age-structured, compartmental SEIRS model.
Mastering Shiny: an online book that teaches how to create web applications with R and Shiny.
The Art of Giving and Receiving Code Reviews (Gracefully)
Code Review in the Lab
Scientific Code Review
The 5 Golden Rules of Code Review
GitHub Actions for Python: the GitHub Actions documentation includes examples of building and testing Python projects.
Building reproducible analytical pipelines with R: this article shows how to use GitHub Actions to run R code when you push new commits to a GitHub repository.
GitHub Actions for the R language: this repository provides a variety of GitHub actions for R projects, such as installing specific versions of R and R packages.
See the GitHub actions for Git is my lab book. The build action does the following:
Check out the repository with actions/checkout
;
Install Python with actions/setup-python
;
Install Material for MkDocs and other required tools, as listed in requirements.txt
; and
Build the HTML version of this book with mkdocs
.
How to access the ARDC Nectar Research Cloud
Melbourne Research Cloud
High Performance Computing at University of Melbourne
The ARDC Guide to making software citable explains how to cite your code and assign it a DOI.
Recognizing the value of software: a software citation guide provides further examples and guidance for ensuring your work receives proper attribution and credit.
Choose an open source license provides advice for selecting an appropriate license that meets your needs.
A Quick Guide to Software Licensing for the Scientist-Programmer explains the various types of available licenses and provides advice for selecting a suitable license.
This section demonstrates how to use Git for collaborative research, enabling multiple people to work on the same code or paper in parallel. This includes deciding how to structure your repository, how to use branches for each collaborator, and how to use tags to track your progress.
Info
We also show how these skills support peer code review, so that you can share knowledge with, and learn from, your colleagues as part of your regular activity.
"},{"location":"guides/collaborating/an-example-pull-request/","title":"An example pull request","text":"The initial draft of each chapter in this section were proposed in a pull request.
When this pull request was created, the branch added four new commits:
85594bf Add some guidelines for collaboration workflows\n678499b Discuss coding style guides\n2b9ff70 Discuss merge/pull requests and peer code review\n6cc6f54 Discuss repository structure and licenses\n
and the author (Rob Moss) asked the reviewer (Eamon Conway) to address several details in particular.
Eamon made several suggestions in their initial response, including:
Moving the How to structure a repository and Choosing a license chapters to the Effective use of git section;
Starting this section with the Collaborating on code chapter; and
Agreeing that we should use this pull request as an example in this book.
In response, Rob pushed two commits that addressed the first two points above:
e1d1dd9 Move collaboration guidelines to the start\n3f78ef8 Move the repository structure and license chapters\n
and then wrote this chapter to show how we used a pull request to draft this book section.
"},{"location":"guides/collaborating/coding-style-guides/","title":"Coding style guides","text":"A style guide defines rules and guidelines for how to structure and format your code. This can make code easier to write, because you don't need to worry about how to format your code. It can also make code easier to read, because consistent styling allows you to focus on the content.
There are two types of tools that can help you use a style guide:
A formatter formats your code to make it consistent with a specific style; and
A linter checks whether your code is consistent with a specific style.
Because programming languages can be very different from each other, style guides are usually defined for a single programming language.
Here we list some of the most widely-used style guides for several common programming languages:
You can check that your code conforms to this style with lintr.
For Python there is Black, which defines a coding style and applies this style to your code.
For C++ there is a Google C++ style guide.
Once you are comfortable with creating commits, working in branches, and merging branches, you can use these skills to write papers collaboratively as a team. This approach is particularly useful if you are writing a paper in LaTeX.
Here are some general guidelines that you may find useful:
Divide the paper into separate LaTeX files for each section.
Use tags to identify milestones such as draft versions and revisions.
Consider creating a separate branch for each collaborator.
Use latexdiff to show tracked changes between the current version and a previous commit/tag:
latexdiff-git --flatten -r tag-name paper.tex\n
Collaborators who will provide feedback, rather than contributing directly to the writing process, can do this by:
Annotating PDF versions of the paper; or
Once you are comfortable with creating commits, working in branches, and merging branches, you can use these skills to write code collaboratively as a team.
The precise workflow will depend on the nature of your research and on the collaborators in your team, but there are some general guidelines that you may find helpful:
Agree on a style guide.
Work on separate features in separate branches.
Use peer code review before merging changes from these branches.
Consider using continuous integration to:
Run test cases and detect bugs as early as possible; and
Continuous Integration (CI) is an automated process where code changes are merged in a central repository in order to run automated tests and other processes. This can provide rapid feedback while you develop your code and collaborate with others, as long as commits are regularly pushed to the central repository.
Info
This book is an example of Continuous Integration: every time a commit is pushed to the central repository, the online book is automatically updated.
Because the central repository is hosted on GitHub, we use GitHub Actions. Note that this is a GitHub-specific CI system. You can view the update action for this book here.
We also use CI to publish each pull request, so that contributions can be previewed during the review process. We added this feature in this pull request.
"},{"location":"guides/collaborating/merge-pull-requests/","title":"Merge/Pull requests","text":"Recall that incorporating the changes from one branch into another branch is referred to as a \"merge\". You can merge one branch into another branch by taking the following steps:
Checking out the branch you want to merge the changes into:
git checkout -b my-branch\n
Merging the changes from the other branch:
git merge other-branch\n
Tip
It's a good idea to review these changes before you merge them.
If possible, it's even better to have someone else review the changes.
You can use git diff
to view differences between branches. However, platforms such as GitHub and GitLab offer an easier approach: \"pull requests\" (also called \"merge requests\").
The steps required to create a pull request differ depending on which platform you are using. Here, we will describe how to create a pull request on GitHub. For further details, see the GitHub documentation.
Open the main page of your GitHub repository.
In the \"Branch\" menu, select the branch that contains the changes you want to merge.
Open the \"Contribute\" menu. This should be located on the right-hand side, above the list of files.
Click the \"Open pull request\" button.
In the \"base\" menu, select the branch you want to merge the changes into.
Enter a descriptive title for the pull request.
In the message editor, write a summary of the changes in this branch, and identify specific questions or objectives that you want the reviewer to address.
Select potential reviewers by clicking on the \"Reviewers\" link in the right-hand sidebar.
Click the \"Create pull request\" button.
Once the pull request has been created, the reviewer(s) can review your changes and discuss their feedback and suggestions with you.
"},{"location":"guides/collaborating/merge-pull-requests/#merging-a-pull-request-on-github","title":"Merging a pull request on GitHub","text":"When the pull request has been reviewed to your satisfaction, you can merge these changes by clicking the \"Merge pull request\" button.
Info
If the pull request has merge conflicts (e.g., if the branch you're merging into contains new commits), you will need to resolve these conflicts.
For further details about merging pull requests on GitHub, see the GitHub documentation.
"},{"location":"guides/collaborating/peer-code-review/","title":"Peer code review","text":"Once you're comfortable in using merge/pull requests to review changes in a branch, you can use this approach for peer code review.
Info
Remember that code review is a discussion and critique of a person's work. The code author will naturally feel that they own the code, and the reviewer needs to respect this.
For further advice and suggestions on how to conduct peer code review, please see the Performing peer code review resources.
Tip
Mention people who have reviewed your code in the acknowledgements section of the paper.
"},{"location":"guides/collaborating/peer-code-review/#define-the-goals-of-a-peer-review","title":"Define the goals of a peer review","text":"In creating a pull request and inviting someone to review your work, the pull request description should include the following details:
An overview of the work included in the pull request: what have you done, why have you done it?
You may also want to explain how this work fits into the broader context of your research project.
Identify specific questions or tasks that you would like the reviewer to address. For example, you might ask the reviewer to address one or more of the following questions:
Can the reviewer run your code and reproduce the outputs?
Is the code easy to understand?
If you have a style guide, is the code formatted appropriately?
Do the model equation or data analysis steps seem sensible?
If you have written documentation, is it easy to understand?
Can the reviewer suggest how to improve or rewrite a specific piece of code?
Tip
Make the reviewer's job easier by giving them small amounts of code to review.
"},{"location":"guides/collaborating/peer-code-review/#finding-a-reviewer","title":"Finding a reviewer","text":"On GitHub we have started a peer-review team. We encourage you to post on the discussion board, to find like-minded members to review your code.
"},{"location":"guides/collaborating/peer-code-review/#guidelines-for-reviewing-other-peoples-code","title":"Guidelines for reviewing other people's code","text":"Peer code review is an opportunity for the author and the reviewer to learn from each other and improve a piece of code.
Tip
The most important guideline for the reviewer is to be kind.
Treat other people's code the way you would want them to treat your code.
Avoid saying \"you\". Instead, say \"we\" or make the code the subject of the sentence.
Don't say \"You don't have a test for this function\", but instead say \"We should test this function\".
Don't say \"Why did you write it this way?\", but instead say \"What are the advantages of this approach?\".
Ask questions rather than stating criticisms.
Don't say \"This code is unclear\", but instead say \"Can you help me understand how this code works?\".
Treat peer review as an opportunity to praise good work!
Don't be afraid to tell the author that a piece of code was very clear, easy to understand, or well written.
Tell the author if reading their code made you aware of a useful function or package.
Tell the author if reading their code gave you some ideas for your own code.
Once the peer code review is complete, and any corresponding updates to the code have been made, you can merge the branch.
"},{"location":"guides/collaborating/peer-code-review/#retain-a-record-of-the-review","title":"Retain a record of the review","text":"By using merge/pull requests to review code, the discussion between the author and the reviewer is recorded. This can be a useful reference for future code reviews.
Tip
Try to record all of the discussion in the pull request comments, even if the author and reviewer meet in person, so that you have a complete record of the review.
"},{"location":"guides/collaborating/sharing-a-branch/","title":"Sharing a branch","text":"You might want a collaborator to work on a specific branch of your repository, so that you can keep their changes separate from your own work. Remember that you can merge commits from their branch into your own branches at any time.
Info
You need to ensure that your collaborator has access to the remote repository.
Create a new branch for the collaborator, and give it a descriptive name.
git checkout -b collab/jamie\n
In this example we created a branch called \"collab/jamie\", where \"collab\" is a prefix used to identify branches intended for collaborators, and the collaborator is called Jamie.
Remember that you can choose your own naming conventions.
Push this branch to your remote repository:
git push -u origin collab/jamie\n
Your collaborator can then make a local copy of this branch:
git clone --single-branch --branch collab/jamie repository-url\n
They can then create commits and push them to your remote repository with git push
.
The easiest way to share a repository with collaborators is to have a single remote repository that all collaborators can access. This repository could be located on a platform such as GitHub, GitLab, or Bitbucket, or on a platform provided by your University or Institute.
Theses platforms allow you to create public repositories and private repositories.
Everybody can view the contents of a public repository.
You control who can view the contents of a private repository.
For both types of repository, you control who can make changes to the repository, such as creating commits and branches.
Info
You should decide whether a public repository or a private repository suits you best.
"},{"location":"guides/collaborating/sharing-a-repository/#giving-collaborators-access-to-your-remote-repository","title":"Giving collaborators access to your remote repository","text":"The steps required to do this differ depending on which platform you are using. Here, we will describe how to give collaborators access to a repository on GitHub. For further details, see the GitHub documentation.
Open the main page of your GitHub repository.
Click on the \"Settings\" tab in the top navigation bar.
Click on the \"Collaborators\" item in the left sidebar.
Click on the \"Add people\" button.
Search for collaborators by entering their GitHub user name, their full name, or their email address.
Click the \"Add to this repository\" button.
This will send an invitation to the collaborator. If they accept this invitation, they will have access to your repository.
"},{"location":"guides/high-performance-computing/","title":"Cloud and HPC platforms","text":"This section introduces computing platforms that allow you to generate outputs more quickly, and without relying on your own laptop or desktop computer. It also demonstrates how to use version control to ensure that the code running on these platforms is the same as the code on your laptop.
"},{"location":"guides/project-structure/","title":"Project structure","text":"How we choose to structure a project can affect how readily someone else \u2014 or even yourself, after a period of absence \u2014 can understand, use, and extend the work.
Question
Have you ever looked at your old code and wondered how it worked or how to make it run?
Tip
A good project structure can serve as a table of contents and help the reader to navigate.
In an earlier section we provided some guidelines for how to structure a repository. In this section we present further guidelines and examples to help you choose a sensible structure for your current project and future projects.
This includes high-level recommendations that should apply to any project, and more detailed recommendations that may be specific to a particular type of project or choice of programming language.
"},{"location":"guides/project-structure/automating-tasks/","title":"Automate common tasks","text":"If you reach the point where you need to run a specific sequence of commands or actions to achieve something \u2014 e.g., running a model simulation, or producing an output figure \u2014 it is a very good idea to write a script that performs all of these actions correctly.
This is because while you may remember exactly what needs to be done right now, you may not remember next week, or next month, or next year. We're all human, and we all make mistakes, but these kinds of mistakes are easy to avoid!
Info
Mistakes are a fact of life. It is the response to the error that counts.
\u2014 Nikki Giovanni
There are many tools that can help you to automate tasks, some of which are smart enough that they will only do as little as possible (e.g., avoid re-running steps if the inputs have not changed).
There are popular tools aimed at specific programming languages, such as:
R: targets;
Python: nox and tox; and
Julia: pipelines.
There are many generic automation tools (see, e.g., Wikipedia's list of build automation software), although these can be rather complex to learn. We recommend using a language-specific automation tool where possible, and only using a generic automation tool as a last resort.
"},{"location":"guides/project-structure/exercise-a-good-readme/","title":"Exercise: a good README","text":"Remember that the README file (usually one of README.md
, README.rst
, or README.txt
) is often the very first thing that a user will see when looking at your project.
Have you seen any README files that were particularly helpful, or were not very helpful?
What information do you find helpful in a README file?
Consider the README.md
file in the Australian 2020 COVID-19 forecasts repository.
What content, if any, would you add to this file?
What content, if any, would you remove from this file?
Would you change its structure in any way?
Look back at your past projects and identify aspects of their structure that you have found helpful.
What features or choices have worked well in past projects and might help you structure your future projects?
What problems or issues have you experienced with the structure of your past projects, which you could avoid in your future projects?
Can any of your colleagues and collaborators share similar insights?
Once you've chosen a project structure, you need to write down how it all works \u2014 regardless of how simple and clear your project structure is!
Tip
The best place to do this is in a README.md
file (or equivalent) in the project root directory.
Begin with an overview of the project:
What question(s) are you trying to address?
What data, hypotheses, methods, etc, are you using?
What outputs does this generate?
You can then provide further detail, such as:
What software environment and/or packages must be available for your code to run?
How can the user generate each of the outputs?
What license have you chosen?
See the Australian 2020 COVID-19 forecasts repository for an example README.md
file.
This repository was used to generate the results, tables, and figures presented in the paper \"Forecasting COVID-19 activity in Australia to support pandemic response: May to October 2020\", Scientific Reports 13, 8763 (2023).
Strengths:
It includes installation and usage instructions;
It identifies the paper; and
It identifies the license under which the code is distributed.
Weaknesses:
It only explains some of the project structure.
It doesn't provide an overview of the project, it only links to the paper.
The root directory contains a number of scripts and input files that aren't described.
A good first step in deciding how to structure a project is to ask yourself:
What are the different project phases?
What are the major activities in each phase?
For example, a project might involve the following phases:
Clean an existing data set;
Build models with different hypotheses or features;
Fit each model to the data; and
Decide which model best explains the data.
The data-cleaning phase might involve the following activities:
Obtain the raw data;
Identify the quality checks that should be applied;
Decide how to resolve data that fail each quality check; and
Generate and record the cleaned data.
The model-building phase might involve the following activities:
Perform a literature search to identify relevant modelling studies;
Identify competing hypotheses or features that might explain the data;
Design a model that implements each hypothesis; and
Define the relationship between each model and the cleaned data.
You can use the phases and activities to guide your choice of directory structure. For this example project, one possible structure is:
project/
: the root directory of your project
input/
: a sub-directory that contains input data;
raw/
: the raw data before cleaning;
cleaned/
: the cleaned data;
code/
: a sub-directory that contains the project code;
cleaning/
: the data cleaning code;
model-first-hypothesis/
: the first model;
model-second-hypothesis/
: the second model;
fitting/
: the code that fits each model to the data;
evaluation/
: the code the compares the model fits;
plotting/
: the code that plots output figures;
paper/
: a sub-directory for the project manuscript;
figures/
: the output figures;This section demonstrates how to use version control and software testing to ensure that your research results can be independently reproduced by others.
Tip
Reproducibility is just as much about simple work habits as the tools used to share data and code.
\u2014 Jesse M.\u00a0Alston and Jessica A.\u00a0Rick
"},{"location":"guides/reproducibility/what-is-reproducible-research/","title":"What is reproducible research?","text":"Various scientific disciplines have defined and used the terms \"replication\" and \"reproducible\" in different (and sometimes contradictory) ways. But in recent years there have been efforts to standardise these terms, particularly in the context of computational research. Here we will use the definitions from Reproducibility and Replicability in Science:
Replication: obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.
Reproducible: obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.
Question
If you can't explain your research well enough for someone else to reproduce the results, can you really call what you did \"research\"?
Creating reproducible research requires good work habits and practices, and also a way to verify that the results are reproducible.
It is sometimes said that reproducibility can be achieved by sharing all of the code and data with a published article, but this is not necessarily sufficient for computationally complex studies. There are many reasons why running the same code with the same data may produce different results.
Tip
It is easier to make our research reproducible by starting at the planning stage, rather than waiting until we've produced results.
"},{"location":"guides/testing/","title":"Testing","text":"This section introduces the topic of software testing. Testing your code is an important part of any code-based research activity. Tests check whether your code behaves as intended, and can warn you if you introduce a bug or mistake into your code.
Tip
Tests can show the presence of bugs, but not their absence.
\u2014 Edsger W.\u00a0Dijkstra
"},{"location":"guides/using-git/","title":"Effective use of git","text":"This section shows how to use the git
command-line program to record your work, to inspect your commit history, and to search this commit history to identify commits that make specific changes or have specific effects.
Reminder
Remember to commit early and commit often. Do not wait until your code is \"perfect\".
"},{"location":"guides/using-git/choosing-a-license/","title":"Choosing a license","text":"A license specifies the conditions under which others may use, modify, and/or distribute your work.
Info
Simply making a repository publicly accessible is not sufficient to allow others to make use of your work. Unless you include a license that specifies otherwise, nobody else can copy, distribute, or modify your work.
There are many different types of licenses that you can use, and the number of options can seem overwhelming. But it is usually straightforward to narrow down your options.
If you're working on an existing project, the easiest option is to use that project's license.
If you're working with an existing community, they may have a preferred license.
If you want to choose an open source license, the Choose an open source license website provides advice for selecting a license that meets your needs.
For further information about the various types of available licenses, and some advice for selecting a suitable license for academic software, see A Quick Guide to Software Licensing for the Scientist-Programmer.
"},{"location":"guides/using-git/choosing-your-git-editor/","title":"Choosing your Git editor","text":"In this video, we show how to use nano and vim for writing commit messages. See below for brief instructions on how to use these editors.
Tip
This editor is only used for writing commit messages. It is entirely separate from your choice of editor for any other task, such as writing code.
Git editor exampleVideo timeline:
Note
You can pause the video to select and copy any of the text, such as the git config --global core.editor
commands.
Once you have written your commit message, press Ctrl + O and then Enter to save the commit message, then press Ctrl + X to quit the editor.
To quit without saving press Ctrl + X. If you have made any changes, nano
will ask if you want to save them. Press n to quit without saving these changes.
You need to press i (switch to insert mode) before you can write your commit message. Once you have written your commit message, press Esc and then type :wq to save your changes and quit the editor.
To quit without saving press Esc and then type :q!.
"},{"location":"guides/using-git/cloning-an-existing-repository/","title":"Cloning an existing repository","text":"If there is an existing repository that you want to work on, you can \"clone\" the repository and have a local copy. To do this, you need to know the remote repository's URL.
Tip
For GitHub repositories, there should be a green button labelled \"Code\". Click on this button, and it will provide you with the URL.
You can then make a local copy of the repository by running:
git clone URL\n
For example, to make a local copy of this book, run the following command:
git clone https://github.com/robmoss/git-is-my-lab-book.git\n
This will create a local copy in the directory git-is-my-lab-book
.
Note
If you have a GitHub account and have set up an SSH key, you can clone GitHub repositories using your SSH key. This will allow you to push commits to the remote repository (if you are permitted to do so) without having to enter your user name and password.
You can obtain the SSH URL from GitHub by clicking on the green \"Code\" button, and selecting the \"SSH\" tab.
For example, to make a local copy of this book using SSH, run the following command:
git clone git@github.com:robmoss/git-is-my-lab-book.git\n
"},{"location":"guides/using-git/creating-a-commit/","title":"Creating a commit","text":"Creating a commit involves two steps:
Identify the changes that should be included in the commit. These changes are then \"staged\" and ready to be included in the next commit.
Create a new commit that records these staged changes. This should be accompanied by a useful commit message.
We will now show how to perform these steps.
Note
At any time, you can see a summary of the changes in your repository, and which ones are staged to be committed, by running:
git status\n
This will show you:
If you've created a new file, you can include this file in the next commit by running:
git add filename\n
"},{"location":"guides/using-git/creating-a-commit/#adding-all-changes-in-an-existing-file","title":"Adding all changes in an existing file","text":"If you've made changes to an existing file, you can include all of these changes in the next commit by running:
git add filename\n
"},{"location":"guides/using-git/creating-a-commit/#adding-some-changes-in-an-existing-file","title":"Adding some changes in an existing file","text":"If you've made changes to an existing file and only want to include some of these changes in the next commit, you can select the changes to include by running:
git add -p filename\n
This will show you each of the changes in turn, and allow you select which ones to stage.
Tip
This interactive selection mode is very flexible; you can enter ?
at any of the prompts to see the range of available actions.
If you want to rename a file, you can use git mv
to rename the file and stage this change for inclusion in the next commit:
git mv filename newname\n
"},{"location":"guides/using-git/creating-a-commit/#removing-a-file","title":"Removing a file","text":"If you want to remove a file, you can use git rm
to remove the file and stage this change for inclusion in the next commit:
git rm filename\n
Tip
If the file has any uncommitted changes, git will refuse to remove the file. You can override this behaviour by running:
git rm --force filename\n
"},{"location":"guides/using-git/creating-a-commit/#inspecting-the-staged-changes","title":"Inspecting the staged changes","text":"To verify that you have staged all of the desired changes, you can view the staged changes by running:
git diff --cached\n
You can view the staged changes for a specific file by running:
git diff --cached filename\n
"},{"location":"guides/using-git/creating-a-commit/#undoing-a-staged-change","title":"Undoing a staged change","text":"You may sometimes stage a change for inclusion in the next commit, but decide later that you don't want to include it in the next commit. You can undo staged changes to a file by running:
git restore --staged filename\n
Note
This will not modify the contents of the file.
"},{"location":"guides/using-git/creating-a-commit/#creating-a-new-commit","title":"Creating a new commit","text":"Once you have staged all of the changes that you want to include in the commit, create the commit by running:
git commit\n
This will open your chosen editor and prompt you to write the commit message.
Tip
Note that the commit will not be created until you exit the editor.
If you decide that you don't want to create the commit, you can abort this action by closing your editor without saving a commit message.
Please see Choosing your Git editor for details.
"},{"location":"guides/using-git/creating-a-commit/#modifying-the-most-recent-commit","title":"Modifying the most recent commit","text":"After you create a commit, you might decide that there are other changes that should be included in the commit. Git provides a simple way of modifying the most recent commit.
Warning
Do not modify the commit if you have already pushed it to another repository. Instead, record a new commit that includes the desired changes.
Remember that your commit history should not be a highly-edited, polished view of your work, but should instead act as a lab book.
Do not worry about creating \"perfect\" commits!
To modify the most recent commit, stage the changes that you want to commit (see the sections above) and add them to the most recent commit by running:
git commit --amend\n
This will open your chosen editor and allow you to modify the commit message.
"},{"location":"guides/using-git/creating-a-remote-repository/","title":"Creating a remote repository","text":"Once you have created a \"local\" repository (i.e., a repository that exists on your own computer), it is generally a good idea to create a \"remote\" repository. You may choose to store this remote repository on a service such as GitHub, or on a University-provided platform.
If you are using GitHub, you can choose to create a public repository (viewable by anyone, but you control who can make changes) or a private repository (you control who can view and/or make changes).
"},{"location":"guides/using-git/creating-a-remote-repository/#linking-your-local-and-remote-repositories","title":"Linking your local and remote repositories","text":"Once you have created the remote repository, you need to link it to your local repository. This will allow you to \"push\" commits from your local repository to the remote repository, and to \"pull\" commits from the remote repository to your local repository.
Note
When you create a new repository on services such as GitHub, they will give you instructions on how to link this new repository to your local repository. We also provide an example, below.
A repository can be linked to more than one remote repository, so we need to choose a name to identify this remote repository.
Info
The name \"origin\" is commonly used to identify the main remote repository.
In this example, we link our local repository to the remote repository for this book (https://github.com/robmoss/git-is-my-lab-book
) with the following command:
git remote add origin git@github.com:robmoss/git-is-my-lab-book.git\n
Note
Notice that the URL is similar to, but not identical to, the URL you use to view the repository in your web browser.
"},{"location":"guides/using-git/creating-a-repository/","title":"Creating a repository","text":"You can create repositories by running git init
. This will create a .git
directory that will contain all of the repository information.
There are two common ways to use git init
:
Create an empty repository in the current directory, by running:
git init\n
Create an empty repository in a specific directory, by running:
git init path/to/repository\n
Info
Git will create the repository directory if it does not exist.
In this exercise you will create a local repository, and use this repository to create multiple commits, switch between branches, and inspect the repository history.
Create a new, empty repository in a directory called git-exercise
.
Create a README.md
file and write a brief description for this repository. Record the contents of README.md
in a new commit, and write a commit message.
Write a script that generates a small data set, and saves the data to a CSV file. For example, this script could sample values from a probability distribution with fixed shape parameters. Explain how to use this script in README.md
. Record your changes in a new commit.
Write a script that plots these data, and saves the figure in a suitable file format. Explain how to use this script in README.md
. Record your changes in a new commit.
Add a tag milestone-1
to the commit you created in the previous step.
Create a new branch called feature/new-data
. Check out this branch and modify the data-generation script so that it produces new data and/or more data. Record your changes in one or more new commits.
Create a new branch called feature/summarise
from the tag you created in step #5. Check out this branch and modify the plotting script so that it also prints some summary statistics of the data. Record your changes in one or more new commits.
In your main
or master
branch, and add a license. Record your changes in a new commit.
In your main
or master
branch, merge the two feature branches created in steps #6 and #7, and add a new tag milestone-2
.
Now that you have started a repository, created commits in multiple branches, and merged these branches, here are some questions for you to consider:
Have you committed the generated data file and/or the plot figure?
If you haven't committed either or both of these files, have you instructed git
to ignore them?
Did you add a meaningful description to each milestone tag?
How many commits modified your data-generation script?
How many commits modified your plotting script?
What changes, if any, were made to README.md
since it was first created?
Tip
To answer some of these questions, you may need to run git
commands.
We have created a public repository that you can use to try resolving a merge conflict yourself. This repository includes some example data and a script that performs some basic data analysis.
First, obtain a local copy (a \"clone\") of this repository by running:
git clone https://github.com/robmoss/gimlb-simple-merge-example.git\ncd gimlb-simple-merge-example\n
"},{"location":"guides/using-git/exercise-resolve-a-merge-conflict/#the-repository-history","title":"The repository history","text":"You can inspect the repository history by running git log
. Some key details to notice are:
README.md
LICENSE
analysis/initial_exploration.R
input_data/data.csv
The second commit created the following file:
outputs/summary.csv
This commit has been given the tag first_milestone
.
From this first_milestone
tag, two branches were created:
The feature/second-data-set
branch adds a second data set and updates the analysis script to inspect both data sets.
The feature/calculate-rate-of-change
branch changes which summary statistics are calculated for the original data set.
The example-solution
branch merges both feature branches and resolves any merge conflicts. This branch has been given the tag second_milestone
.
You will start with the master
branch, which contains the commits up to the first_milestone
tag, and then merge the two feature branches into this branch, resolving any merge conflicts that arise. You can then compare your results to the example-solution
branch.
Obtain a local copy of this repository, by running:
git clone https://github.com/robmoss/gimlb-simple-merge-example.git\ncd gimlb-simple-merge-example\n
Create local copies of the two feature branches and the example solution, by running:
git checkout feature/second-data-set\ngit checkout feature/calculate-rate-of-change\ngit checkout example-solution\n
Return to the master
branch, by running:
git checkout master\n
Merge the feature/second-data-set
branch into master
, by running:
git merge feature/second-data-set\n
Merge the feature/calculate-rate-of-change
branch into master
, by running:
git merge feature/calculate-rate-of-change\n
This will result in a merge conflict, and now you need to decide how to resolve each conflict! Once you have resolved the conflicts, create a commit that records all of your changes (see the previous chapter for an example).
Tip
You may find it helpful to inspect the commits in each of the feature branches to understand how they have changed the files in which the conflicts have occurred.
"},{"location":"guides/using-git/exercise-resolve-a-merge-conflict/#self-evaluation","title":"Self evaluation","text":"Once you have created a commit that resolves these conflicts, see how similar or different the contents of your commit are to the corresponding commit in the example-solution
branch (which has been tagged second_milestone
). You can inspect this commit by running:
git show example-solution\n
You can compare this commit to your solution by running:
git diff example-solution\n
How does your resolution compare to this commit?
Note
You may have resolved the conflicts differently to the example-solution
branch, and that's perfectly fine as long as they have the same effect.
Here we present a recorded terminal session in which we clone this repository and resolve the merge conflict.
Tip
You can use the video timeline (below) to jump to specific moments in this exercise. Remember that you can pause the recording at any point to select and copy any of the text.
Resolving a merge conflictVideo timeline:
feature/second-data-set
branchfeature/calculate-rate-of-change
branchfeature/second-data-set
branchfeature/calculate-rate-of-change
branchIn this exercise, you will use a remote repository to synchronise and merge changes between multiple local repositories, starting from the local git-exercise
repository that you created in the previous exercise.
Create a new remote repository on a platform such as GitHub. You can make this a private repository, because you won't need to share it with anyone.
Link your local git-exercise
repository to this remote repository, and push all branches and tags to this remote repository.
Make a local copy of this remote repository called git-exercise-2
.
Check out the main
or master
branch. The files should be identical to the milestone-2
tag in your original git-exercise
repository.
Create a new branch called feature/report
. Check out this branch and create a new file called report.md
. Edit this file so that it contains:
A brief description of the generated data set;
Record your changes in a new commit.
In your original git-exercise
repository, checkout the feature/report
branch from the remote repository and verify that it now contains the file report.md
.
Merge this branch into your main
or master
branch, and add a new tag milestone-3-report
.
Push the updated main
or master
branch to the remote repository.
git-exercise-2
repository, checkout the main
or master
branch and pull changes from the remote repository. It should now contain the file report.md
.Info
Congratulations! You have used a remote repository to synchronise and merge changes between two local repositories. You can use this workflow to collaborate with colleagues.
"},{"location":"guides/using-git/exercise-use-a-remote-repository/#self-evaluation","title":"Self evaluation","text":"Now that you have used commits and branches to share work between multiple repositories, here are some questions for you to consider:
Do you feel comfortable in deciding which changes to record in a single commit?
Do you feel that your commit messages help describe the changes that you have made in this repository?
Do you feel comfortable in using multiple branches to work on separate ideas in parallel?
Do you have any current projects that you might want to work on using local and remote git
repositories?
Once you've installed Git, you should define some important settings before you starting using Git.
Info
We assume that you will want to set the git configuration for all repositories owned by your user. Therefore, we use the --global
flag. Configuration files can be set for a single repository or the whole computer by replacing --global
with --local
or --system
respectively.
Define your user name and email address. These details are included in every commit that you create.
git config --global user.name \"My Name\"\ngit config --global user.email \"my.name@some-university.edu.au\"\n
2. Define the text editor that Git should use for tasks such as writing commit messages: git config --global core.editor editor-name\n
NOTE: on Windows you need to specify the full path to the editor:
git config --global core.editor \"C:/Program Files/My Editor/editor.exe\"\n
Tip
Please see Choosing your Git editor for details.
By default, Git will create a branch called master
when you create a new repository. You can set a different name for this initial branch:
git config --global init.defaultBranch main\n
Ensure that repository histories always record when branches were merged:
git config --global merge.ff no\n
This prevents Git from \"fast-forwarding\" when the destination branch contains no new commits. For example, it ensures that when you merge the green branch into the blue branch (as shown below) it records that commits D, E, and F came from the green branch.
Adjust how Git shows merge conflicts:
git config --global merge.conflictstyle diff3\n
This will be useful when we look at how to use branches and how to resolve merge conflicts.
Info
If you use Windows, there are tools that can improve your Git experience in PowerShell.
There are also tools for integrating Git into many common text editors. See Git in other environments, Appendix A of the Pro Git book.
"},{"location":"guides/using-git/graphical-git-clients/","title":"Graphical Git clients","text":"In this book we will primarily show how to use Git from the command-line. If you don't have Git already installed on your computer, see these instructions for installing Git.
In addition to using the command-line, there are other ways to work with Git repositories:
There are many graphical clients that you can download and use;
Many editors include built-in Git support (e.g., Atom, RStudio, Visual Studio Code); and
Online platforms such as GitHub, GitLab, and Bitbucket also provide a graphical interface for common Git actions.
All of the concepts and terminology you will learn in this book should also apply to all of the tools listed above.
"},{"location":"guides/using-git/how-to-create-and-use-tags/","title":"How to create and use tags","text":"Tags allow you to bookmark important points in your commit history.
You can use tags to identify milestones such as:
feature-age-dependent-mixing
);objective-1
, objective-2
);draft-1
, draft-2
); andsubmitted
, revision-1
).You can add a tag (in this example, \"my-tag\") to the current commit by running:
git tag -a my-tag\n
This will open your chosen editor and ask you to write a description for this tag.
"},{"location":"guides/using-git/how-to-create-and-use-tags/#pushing-tags-to-a-remote-repository","title":"Pushing tags to a remote repository","text":"By default, git push
doesn't push tags to remote repositories. Instead, you have to explicitly push tags. You can push a tag (in this example, called \"my-tag\") to a remote repository (in this example, called \"origin\") by running:
git push origin my-tag\n
You can push all of your tags to a remote repository (in this example, called \"origin\") by running:
git push origin --tags\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#tagging-a-past-commit","title":"Tagging a past commit","text":"To add a tag to a previous commit, you can identify the commit by its hash. For example, you can inspect your commit history by running:
git log --oneline --no-decorate\n
If your commit history looks like:
003cf6b Show how to ignore certain files\n339eb5a Show how to prepare and record commits\n6a7fb8b Show how to clone remote repositories\n...\n
where the current commit is 003cf6b
(\"Show how to ignore certain files\"), you can tag the previous commit (\"Show how to prepare and record commits\") by running: git tag -a my-tag 339eb5a\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#listing-tags","title":"Listing tags","text":"You can list all tags by running:
git tag\n
You can also list only tags that match a specific pattern (in this example, all tags beginning with \"my\") by running:
git tag --list 'my*'\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#deleting-tags","title":"Deleting tags","text":"You can delete a tag by running:
git tag --delete my-tag\n
"},{"location":"guides/using-git/how-to-create-and-use-tags/#creating-a-branch-from-a-tag","title":"Creating a branch from a tag","text":"You can check out a tag and begin working on a new branch by running:
git checkout -b my-branch my-tag\n
"},{"location":"guides/using-git/how-to-ignore-certain-files/","title":"How to ignore certain files","text":"Your repository may contain files that you don't want to include in your commit history. For example, you may not want to include files of the following types:
.aux
files, which are generated when compiling LaTeX documents; and.pyc
files, which are generated when running Python code..pdf
versions of LaTeX documents; andYou can instruct Git to ignore certain files by creating a .gitignore
file. This is a plain text file, where each line defines a pattern that identifies files and directories which should be ignored. You can also add comments, which must start with a #
, to explain the purpose of these patterns.
Tip
If your editor will not accept .gitignore
as a file name, you can create a .gitignore
file in your repository by running:
touch .gitignore\n
For example, the following .gitignore
file would make Git ignore all .aux
and .pyc
files, and the file my-paper.pdf
:
# Ignore all .aux files generated by LaTeX.\n*.aux\n# Ignore all byte-code files generated by Python.\n*.pyc\n# Ignore the PDF version of my paper.\nmy-paper.pdf\n
If you have sensitive data files, one option is to store them all in a dedicated directory and add this directory to your .gitignore
file:
# Ignore all data files in the \"sensitive-data\" directory.\nsensitive-data\n
Tip
You can force Git to add an ignored file to a commit by running:
git add --force my-paper.pdf\n
But it would generally be better to update your .gitignore
file so that it stops ignoring these files.
A merge conflict can occur when we try to merge one branch into another, if the two branches introduce any conflicting changes.
For example, consider trying to merge two branches that make the following changes to the same line of the file test.txt
:
On the branch my-new-branch
:
First line\n-Second line\n+My new second line\n Third line\n
On the main branch:
First line\n-Second line\n+A different second line\n Third line\n
When we attempt to merge my-new-branch
into the main branch, git merge my-new-branch
will tell us:
Auto-merging test.txt\nCONFLICT (content): Merge conflict in test.txt\nAutomatic merge failed; fix conflicts and then commit the result.\n
The test.txt
file will now include the conflicting changes, which we can inspect with git diff
:
diff --cc test.txt\nindex 18712c4,bc576a6..0000000\n--- a/test.txt\n+++ b/test.txt\n@@@ -1,3 -1,3 +1,7 @@@\n First line\n++<<<<<<< ours\n +A different second line\n++=======\n+ My new second line\n++>>>>>>> theirs\n Third line\n
Note that this two-day diff shows:
Each conflict is surrounded by <<<<<<<
and >>>>>>>
markers, and the conflicting changes are separated by a =======
marker.
If we instruct Git to use a three-way diff (see first-time Git setup), the conflict will be reported slightly differently:
diff --cc test.txt\nindex 18712c4,bc576a6..0000000\n--- a/test.txt\n+++ b/test.txt\n@@@ -1,3 -1,3 +1,7 @@@\n First line\n++<<<<<<< ours\n +A different second line\n++||||||| base\n++Second line\n++=======\n+ My new second line\n++>>>>>>> theirs\n Third line\n
In addition to showing \"our\" changes and \"their changes\", this three-way diff also shows the original lines, between the |||||||
and =======
markers. This extra information can help you decide how to best resolve the conflict.
We can edit test.txt
to reconcile these changes, and the commit our fix. For example, we might decide that test.txt
should have the following contents:
First line\nThe corrected second line\nThird line\n
We can then commit these changes to resolve the merge conflict:
git add test.txt\ngit commit -m \"Resolved the merge conflict\"\n
"},{"location":"guides/using-git/how-to-resolve-merge-conflicts/#cancelling-the-merge","title":"Cancelling the merge","text":"Alternatively, you may decide you don't want to merge these two branches, in which case you cancel the merge by running:
git merge --abort\n
"},{"location":"guides/using-git/how-to-structure-a-repository/","title":"How to structure a repository","text":"While there is no single \"best\" way to structure a repository, there are some guidelines that you can follow. The key aims are to ensure that your files are logically organised, and that others can easily navigate the repository.
"},{"location":"guides/using-git/how-to-structure-a-repository/#divide-your-repository-into-multiple-directories","title":"Divide your repository into multiple directories","text":"It is generally a good idea to have separate directories for different types of files. For example, your repository might contain any of these different file types, and you should at least consider storing each of them in a separate directory:
Choosing file names that indicate what each file/directory contains can help other people, such as your collaborators, navigate your repository. They can also help you when you return to a project after several weeks or months.
Tip
Have you ever asked yourself \"where is the file that contains X\"?
Use descriptive file names, and the answer might be right in front of you!
"},{"location":"guides/using-git/how-to-structure-a-repository/#include-a-readme-file","title":"Include aREADME
file","text":"You can write this in Markdown (README.md
), in plain text (README
or README.txt
), or in any other suitable format. For example, Python projects often use reStructuredText and have a README.rst
file.
This file should begin with a brief description of why the repository was created and what it contains.
Importantly, this file should also mention:
How the files and directories are arranged. Help your collaborators understand where they need to look in order to find something.
How to run important pieces of code (e.g., to generate output data files or figures).
The software packages and/or libraries that are required run any of the code in this repository.
The license (if any) under which the repository contents are being made available.
Recall that branches allow you to work on different ideas or tasks in parallel, within a single repository. In this chapter, we will show you how create and use branches. In the Collaborating section, we will show you how branches can allow multiple people to work together on code and papers, and how you can use branches for peer code review.
Info
Branches, like tags, are identified by name. Common naming conventions include:
feature/some-new-thing
for adding something new (a new data analysis, a new model feature, etc); andbugfix/some-problem
for fixing something that isn't working as intended (e.g., perhaps there's a mistake in a data analysis script).You can choose your own conventions, but make sure that you choose meaningful names.
Do not use names like branch1
, branch2
, etc.
You can create a new branch (in this example, called \"my-new-branch\") that starts from the current commit by running:
git checkout -b my-new-branch\n
You can also create a new branch that starts from a specific commit, tag, or branch in your repository:
git checkout -b my-new-branch 95eaae5 # From an existing commit\ngit checkout -b my-new-branch my-tag-name # From an existing tag\ngit checkout -b my-new-branch my-other-branch # From an existing branch\n
You can then create a corresponding upstream branch in your remote repository (in this example, called \"origin\") by running:
git push -u origin my-new-branch\n
"},{"location":"guides/using-git/how-to-use-branches/#working-on-a-remote-branch","title":"Working on a remote branch","text":"If there is a branch in your remote repository that you want to work on, you can make a local copy by running:
git checkout remote-branch-name\n
This will create a local branch with the same name (in this example, \"remote-branch-name\").
"},{"location":"guides/using-git/how-to-use-branches/#listing-branches","title":"Listing branches","text":"You can list all of the branches in your repository by running:
git branch\n
This will also highlight the current branch.
"},{"location":"guides/using-git/how-to-use-branches/#switching-between-branches","title":"Switching between branches","text":"You can switch from your current branch to another branch (in this example, called \"other-branch\") by running:
git checkout other-branch\n
Info
Git will not let you switch branches if you have any uncommitted changes.
One way to avoid this issue is to record the current changes as a new commit, and explain in the commit message that this is a snapshot of work in progress.
A second option is to discard the uncommitted changes to each file by running:
git restore file1 file2 file3 ... fileN\n
"},{"location":"guides/using-git/how-to-use-branches/#pushing-and-pulling-commits","title":"Pushing and pulling commits","text":"Once you have created a branch, you can use git push
to \"push\" your commits to the remote repository, and git pull
to \"pull\" commits from the remote repository. See Pushing and pulling commits for details.
You can use git log
to inspect the commit history of any branch:
git log branch-name\n
Remember that there are many ways to control what git log
will show you.
Similarly, you can use git diff
to compare the changes in any two branches:
git diff first-branch second-branch\n
Again, there are ways to control what git diff
will show you.
You may reach a point where you want to incorporate the changes from one branch into another branch. This is referred to as \"merging\" one branch into another, and is illustrated in the What is a branch? chapter.
For example, you might have completed a new feature for your model or data analysis, and now want to merge this back into your main branch.
First, ensure that the current branch is the branch you want to merge the changes into (this is often your main or master branch). You can them merge the changes from another branch (in this example, called \"other-branch\") by running:
git merge other-branch\n
This can have two different results:
The commits from other-branch
were merged successfully into the current branch; or
There were conflicting changes (referred to as a \"merge conflict\").
In the next chapter we will show you how to resolve merge conflicts.
"},{"location":"guides/using-git/inspecting-your-history/","title":"Inspecting your history","text":"You can inspect your commit history at any time with the git log
command. By default, this command will list every commit from the very first commit to the current commit, and for each commit it will show you:
There are many ways to adjust which commits and what details that git log
will show.
Tip
Each commit has a unique identifier (\"hash\"). These hashes are quite long, but in general you only need to provide the first 5-7 digits to uniquely identify a specific commit.
"},{"location":"guides/using-git/inspecting-your-history/#listing-commits-over-a-specific-time-interval","title":"Listing commits over a specific time interval","text":"You can limit which commits git log
will show by specifying a start time and/or an end time.
Tip
This can be extremely useful for generating progress reports and summarising your recent activity in team meetings.
For example, you can view commits from the past week by running:
git log --since='7 days'\ngit log --since='1 week'\n
You can view commits made between 1 and 2 weeks ago by running:
git log --since='2 weeks' --until='1 week'\n
You can view commits made between specific dates by running:
git log --since='2022/05/12' --until='2022/05/14'\n
"},{"location":"guides/using-git/inspecting-your-history/#listing-commits-that-modify-a-specific-file","title":"Listing commits that modify a specific file","text":"You can see which commits have made changes to a file by running:
git log -- filename\n
Info
Note the --
argument that comes before the file name. This ensures that if the file name begins with a -
, git log
will not treat the file name as an option.
You can make git log
display only the first 7 digits of each commit hash, and the first line of each commit message, by running:
git log --oneline\n
This can be a useful way to get a quick overview of the recent history.
"},{"location":"guides/using-git/inspecting-your-history/#viewing-the-contents-of-a-single-commit","title":"Viewing the contents of a single commit","text":"You can identify a commit by its unique identifier (\"hash\") or by its tag name (if it has been tagged), and view the commit with git show
:
git show commit-hash\ngit show tag-name\n
This will show the commit details and all of the changes that were recorded in this commit.
Tip
By default, git show
will show you the most recent commit.
You can view all of the changes that were made between two commits with the git diff
command.
Tip
The git diff
command shows the difference between two points in your commit history.
Note that git diff
does not support start and/or end times like git log
does; you must use commit identifiers.
For example, here is a subset of the commit history for this book's repository:
95eaae5 Note the need for a GitHub account and SSH key\n11085f0 Show how to create a branch from a tag\n9369482 Show how to create and use tags\n003cf6b Show how to ignore certain files\n339eb5a Show how to prepare and record commits\n6a7fb8b Show how to clone remote repositories\n6a49e10 Note that mdbook-admonish must be installed\na8e6114 Fixed the URL for the UoM GitLab instance\n5192704 Add a merge conflict exercise\n
We can view all of the changes that were made after the bottom commit (5192704
, \"Add a merge conflict exercise\") up to and including the top commit (95eaae5
, \"Note the need for a GitHub account and SSH key\") by running:
git diff 5192704..95eaae5\n
In the above example, 8 files were changed, with a total of 310 new lines and 7 deleted lines. This is a lot of information! You can print a summary of these changes by running:
git diff --stat 5192704..95eaae5\n
This should show you the following details:
README.md | 2 +-\n src/SUMMARY.md | 3 +\n src/prerequisites.md | 2 +\n src/using-git/cloning-an-existing-repository.md | 36 ++++++++++\n src/using-git/creating-a-commit.md | 146 +++++++++++++++++++++++++++++++++++++--\n src/using-git/how-to-create-and-use-tags.md | 89 ++++++++++++++++++++++++\n src/using-git/how-to-ignore-certain-files.md | 37 ++++++++++\n src/version-control/what-is-a-repository.md | 2 +-\n 8 files changed, 310 insertions(+), 7 deletions(-)\n
This reveals that about half of the changes (146 new/deleted lines) were made to src/using-git/creating-a-commit.md
.
Similar to the git log
command, you can limit the files that the git diff
command will examine. For example, you can display only the changes made to README.md
in the above example by running:
git diff 5192704..95eaae5 -- README.md\n
This should show you the following change:
diff --git a/README.md b/README.md\nindex 7956b65..a34f907 100644\n--- a/README.md\n+++ b/README.md\n@@ -15,7 +15,7 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 Inter\n\n ## Building the book\n\n-You can build this book by installing [mdBook](https://rust-lang.github.io/mdBook/) and running the following command in this directory:\n+You can build this book by installing [mdBook](https://rust-lang.github.io/mdBook/) and [mdbook-admonish](https://github.com/tommilligan/mdbook-admonish/), and running the following command in this directory:\n\n ```shell\n mdbook build\n
"},{"location":"guides/using-git/pushing-and-pulling-commits/","title":"Pushing and pulling commits","text":"In general, we \"push\" commits from our local repository to a remote repository by running:
git push <remote-repository>\n
and \"pull\" commits from a remote repository into our local repository by running:
git pull <remote-repository>\n
where <remote-repository>
is either a URL or the name of a remote repository.
However, we generally want to push to, and pull from, the same remote repository every time. See the next section for an example of linking the main branch in your local repository with a corresponding \"upstream\" branch in your remote repository.
"},{"location":"guides/using-git/pushing-and-pulling-commits/#pushing-your-first-commit-to-a-remote-repository","title":"Pushing your first commit to a remote repository","text":"In order to push commits from your local repository to a remote repository, we need to create a branch in the remote repository that corresponds to the main branch of our local repository. This requires that you have created at least one commit in your local repository.
Tip
This is a good time to create a README.md
file and write a brief description of what this repository will contain.
Once you have at least one commit in your local repository, you can create a corresponding upstream branch in the remote repository with the following command:
git push -u origin <branch-name>\n
The default branch will probably be called \"main\"
or \"master\"
, depending on your Git settings. You can identify the branch name by running:
git branch\n
Note
Recall that we identify remote repositories by name. In this example, the remote repository is call \"origin\". You can choose a different name when linking your local and remote repositories.
Once you have defined the upstream branch, you can push commits by running:
git push\n
and pull commits by running:
git pull\n
without having to specify the remote repository or branch name.
"},{"location":"guides/using-git/pushing-and-pulling-commits/#forcing-updates-to-a-remote-repository","title":"Forcing updates to a remote repository","text":"By default, Git will refuse to push commits from a local branch to a remote branch if the remote branch contains any commits that are not in your local branch. This situation should not arise in general, and typically indicates that either someone else has pushed new commits to the remote branch (see the Collaborating section) or that you have altered the history of your local branch.
If you are absolutely confident that your local history of commits should replace the contents of the remote branch, you can force this update by running:
git push --force\n
Tip
Unless you are confident that you understand why this situation has occurred, it is probably a good idea to ask for advice before running the above command.
"},{"location":"guides/using-git/where-did-this-line-come-from/","title":"Where did this line come from?","text":"Consider the What should I commit? chapter. Imagine that we want to know when and why the following text was added:
A helpful guideline is \"**commit early, commit often**\".\n
If we can identify the relevant commit, we can then inspect the commit (using git show <commit>
) to see all of the changes that it introduced. Ideally, the commit message will explain the reasons why this commit was made. This is one way in which your commit messages can act as a lab book.
At the time of writing (commit 2a96324
), the contents of the What should I commit? came from two commits:
git log --oneline src/version-control/what-should-I-commit.md\n
3dfff1f Add notes about committing early and often\n9be780b Briefly describe key version control concepts\n
We can use the git blame
command to identify the commit that last modified each line in this file:
git blame -s src/version-control/what-should-I-commit.md\n
9be780b8 1) # What should I commit?\n9be780b8 2)\n9be780b8 3) A commit should represent a **unit of work**.\n9be780b8 4)\n9be780b8 5) If you've made changes that represent multiple units of work (e.g., changing how input data are processed, and adding a new model parameter) these should be saved as separate commits.\n9be780b8 6)\n9be780b8 7) Try describing out loud the changes you have made, and if you find yourself saying something like \"I did X and Y and Z\", then the changes should probably divided into multiple commits.\n3dfff1fe 8)\n3dfff1fe 9) A helpful guideline is \"**commit early, commit often**\".\n3dfff1fe 10)\n3dfff1fe 11) ## Commit early\n3dfff1fe 12)\n3dfff1fe 13) - Don't delay creating a commit because \"it's not ready yet\".\n3dfff1fe 14)\n3dfff1fe 15) - A commit doesn't have to be \"perfect\".\n3dfff1fe 16)\n3dfff1fe 17) ## Commit often\n3dfff1fe 18)\n3dfff1fe 19) - Small, focused commits are **extremely helpful** when trying to identify the cause of an unintended change in your code's behaviour or output.\n3dfff1fe 20)\n3dfff1fe 21) - There is no such thing as too many commits.\n
You can see that the first seven lines were last modified by commit 9be780b
(Briefly describe key version control concepts), while the rest of the file was last modified by commit 3dfff1f
(Add notes about committing early and often). So the text that we're interested in (line 9) was introduced by commit 3dfff1f
.
You can inspect this commit by running the following command:
git show 3dfff1f\n
Video demonstration "},{"location":"guides/using-git/where-did-this-problem-come-from/","title":"Where did this problem come from?","text":"Let's find the commit that created the file src/version-control/what-is-a-repository.md
. We could find this out using git log
, but the point here is to illustrate how to use a script to find the commit that causes any arbitrary change to our repository.
Once the commit has been found, you can inspect it (using git show <commit>
) to see all of the changes this commit introduced and the commit message that (hopefully) explains the reasons why this commit was made. This is one way in which your commit messages can act as a lab book.
Create a Python script called my_test.py
with the following contents:
#!/usr/bin/env python3\nfrom pathlib import Path\nimport sys\n\nexpected_file = Path('src') / 'version-control' / 'what-is-a-repository.md'\n\nif expected_file.exists():\n # This file is the \"new\" thing that we want to detect.\n sys.exit(1)\nelse:\n # The file does not exist, this commit is \"old\".\n sys.exit(0)\n
For reference, here is an equivalent R script:
#!/usr/bin/Rscript --vanilla\n\nexpected_file <- file.path('src', 'version-control', 'what-is-a-repository.md')\n\nif (file.exists(expected_file)) {\n # This file is the \"new\" thing that we want to detect.\n quit(status = 1)\n} else {\n # The file does not exist, this commit is \"old\".\n quit(status = 0)\n}\n
Select the commit range over which to search. We know that the file exists in the commit 3dfff1f
(Add notes about committing early and often), and it did not exist in the very first commit (5a19b02
).
Instruct Git to start searching with the following command:
git bisect start 3dfff1f 5a19b02\n
Note that we specify the newer commit first, and then the older commit.
Git will inform you about the search progress, and which commit is currently being investigated.
Bisecting: 7 revisions left to test after this (roughly 3 steps)\n[92f1375db21dd8a35ca141365a477b963dbbf6dc] Add CC-BY-SA license text and badge\n
Instruct Git to use the script my_test.py
to check each commit with the following command:
git bisect run ./my_test.py\n
It will continue to report the search progress and automatically identify the commit that we're looking for:
running './my_test.py'\nBisecting: 3 revisions left to test after this (roughly 2 steps)\n[9be780b8785d67ee191b2c0b113270059c9e0c3a] Briefly describe key version control concepts\nrunning './my_test.py'\nBisecting: 1 revision left to test after this (roughly 1 step)\n[055906f28da146a2d012b7c1c0e4707503ed1b11] Display example commit message as plain text\nrunning './my_test.py'\nBisecting: 0 revisions left to test after this (roughly 0 steps)\n[1251357ab5b41d511deb48cd5386cae37eec6751] Rename the \"What is a repository?\" source file\nrunning './my_test.py'\n1251357ab5b41d511deb48cd5386cae37eec6751 is the first bad commit\ncommit 1251357ab5b41d511deb48cd5386cae37eec6751\nAuthor: Rob Moss <robm.dev@gmail.com>\nDate: Sun Apr 17 21:41:43 2022 +1000\n\n Rename the \"What is a repository?\" source file\n\n The file name was missing the word \"a\" and did not match the title.\n\n src/SUMMARY.md | 2 +-\n src/version-control/what-is-a-repository.md | 18 ++++++++++++++++++\n src/version-control/what-is-repository.md | 18 ------------------\n 3 files changed, 19 insertions(+), 19 deletions(-)\n create mode 100644 src/version-control/what-is-a-repository.md\n delete mode 100644 src/version-control/what-is-repository.md\n
To quit the search and return to your current commit, run the following command:
git bisect reset\n
You can then inspect this commit by running the following command:
git show 1251357\n
This section provides a high-level introduction to the concepts that you should understand in order to make effective use of version control.
Info
Version control can turn your files into a lab book that captures the broader context of your research activities and that you can easily search and reproduce.
"},{"location":"guides/version-control/exercise-using-version-control/","title":"Exercise: using version control","text":"In this section we have introduced version control, and outlined how it can be useful for academic research activities, including:
Info
We'd now like you think about how version control might be useful to you and your research.
Have you experienced any issues or challenges in your career where version control would have been helpful? For example:
Have you ever looked at some of your older code and had difficulty understanding what it is doing, how it works, or why it was written?
Have you ever had difficulties identifying what code and/or data were used to generate a particular analysis or output?
Have you ever discovered a bug in your code and tried to identify when it was introduced, or what outputs it might have affected?
When collaborating on a research project, have you ever had challenges in making sure that everyone was working with the most recent files?
How can you use version control in your current research project(s)?
Do you have an existing project or piece of code that could benefit from being stored in a repository?
Have you recently written any code that could be recorded as one or more commits?
If so, what would you write for the commit messages?
Have you written some exploratory code or analysis that could be stored in a separate branch?
Having looked at the use of version control in the past and present, how would using version control benefit you?
"},{"location":"guides/version-control/how-do-I-write-a-commit-message/","title":"How do I write a commit message?","text":"Commit messages are shown as part of the repository history (e.g., when running git log
). Each message consists of a short one-line description, followed by as much or as little text as required.
You should treat these messages as entries in a log book. Explain what changes were made and why they were made. This can help collaborators understand what we have done, but more importantly is acts as a record for our future selves.
Info
Have you ever looked at code you wrote a long time ago and wondered what you were thinking?
A history of detailed commit messages should allow you to answer this question!
Remember that code is harder to read than it is to write (Joel Spolsky).
For example, rather than writing:
Added model
You could write something like:
Implemented the initial model
This model includes all of the core features that we need to fit the data, but there several other features that we intend to add:
- Parameter X is currently constant, but we may need to allow it to vary over time;
- Parameter Y should probably be a hyperparameter; and
- The population includes age-structured mixing, but we need to also include age-specific outcomes, even though there is very little data to suggest what the age effects might be.
"},{"location":"guides/version-control/what-is-a-branch/","title":"What is a branch?","text":"A branch allows you create a series of commits that are separate from the main history of your repository. They can be used for units of work that are too large to be a single commit.
Info
It is easy to switch between branches! You can work on multiple ideas or tasks in parallel.
Consider a repository with three commits: commit A, followed by commit B, followed by commit C:
At this point, you might consider two ways to implement a new model feature. One way to do this is to create a separate branch for each implementation:
You can work on each branch, and switch between them, in the same local repository.
If you decide that the first implementation (the green branch) is the best way to proceed, you can then merge this branch back into your main branch. This means that your main branch now contains six commits (A to F), and you can continue adding new commits to your main branch:
"},{"location":"guides/version-control/what-is-a-commit/","title":"What is a commit?","text":"A \"commit\" is a set of changes to one or more files in a repository. These changes can include:
Each commit also includes the date and time that it was created, the user that created it, and a commit message.
"},{"location":"guides/version-control/what-is-a-merge-conflict/","title":"What is a merge conflict?","text":"In What is a branch? we presented an example of successfully merging a branch into another. However, when we try to merge one branch into another, we may find that the two branches have conflicting changes. This is known as a merge conflict.
Consider two branches that make conflicting changes to the same line of a file:
Replace \"Second line\" with \"My new second line\":
First line\n-Second line\n+My new second line\n Third line\n
Replace \"Second line\" with \"A different second line\":
First line\n-Second line\n+A different second line\n Third line\n
There is no way to automatically reconcile these two branches, and we have to fix this conflict manually. This means that we need to decide what the true result should be, edit the file to resolve these conflicting changes, and commit our modifications.
"},{"location":"guides/version-control/what-is-a-repository/","title":"What is a repository?","text":"A repository records a set of files managed by a version control system, including the historical record of changes made to these files.
You can create as many repositories as you want. Each repository should be a single \"thing\", such as a research project or a journal article, and should be located in a separate directory.
You will generally have at least two copies of each repository:
A local repository on your computer; and
A remote repository on a service such as GitHub, or a University-provided platform (such as the University of Melbourne's GitLab instance).
You make changes in your local repository and \"push\" them to the remote repository. You can share this remote repository with your collaborators and supervisors, and they will be able to see all of the changes that you have pushed.
You can also allow collaborators to push their own changes to the remote repository, and then \"pull\" them into your local repository. This is one way in which you can use version control to work collaboratively on a project.
"},{"location":"guides/version-control/what-is-a-tag/","title":"What is a tag?","text":"A tag is a short, unique name that identifies a specific commit. You can use tags as bookmarks for interesting or important commits. Common uses of tags include:
Identifying manuscript revisions: draft-1
, submitted-version
, revision-1
, etc.
Identifying software package versions: v1.0
, v1.1
, v2.0
, etc.
Version control is a way of systematically recording changes to files (such as computer code and data files). This allows you to restore any previous version of a file. More importantly, this history of changes can be queried, and each set of changes can include additional information, such as who made the changes and an explanation of why the changes were made.
A core component of making great decisions is understanding the rationale behind previous decisions. If we don't understand how we got \"here\", we run the risk of making things much worse.
\u2014 Chesterton's Fence
For academic research activities that involve data analysis or simulation modelling, some key uses of version control are:
You can use it as a log book, and capture a detailed and permanent record of every step of your research. This is extremely helpful for people \u2014 including you! \u2014 who want to understand and make use of your work.
You can collaborate with others in a systematic way, ensuring that everyone has access to the most recent files and data, and review everyone's contributions.
You can inspect the changes made over a period of interest (e.g., \"What have I done in the last week?\").
You can identify when a specific change occurred, and what other changes were made at the same time (e.g., \"What changes did I make that affected this output figure?\").
In this book we will focus on the Git version control system, which is used by popular online platforms such as GitHub, GitLab, and Bitbucket.
"},{"location":"guides/version-control/what-should-I-commit/","title":"What should I commit?","text":"A commit should represent a unit of work.
If you've made changes that represent multiple units of work (e.g., changing how input data are processed, and adding a new model parameter) these should be saved as separate commits.
Try describing out loud the changes you have made, and if you find yourself saying something like \"I did X and Y and Z\", then the changes should probably divided into multiple commits.
A helpful guideline is \"commit early, commit often\".
"},{"location":"guides/version-control/what-should-I-commit/#commit-early","title":"Commit early","text":"Don't delay creating a commit because \"it's not ready yet\".
A commit doesn't have to be \"perfect\".
Small, focused commits are extremely helpful when trying to identify the cause of an unintended change in your code's behaviour or output.
There is no such thing as too many commits.
For computational research, code is an important scientific artefact for the author, for colleagues and collaborators, and for the scientific community. It is the ultimate form of expressing what you did and how you did it. With good version control and documentation practices, it can also capture when and why you made important decisions.
Tip
[W]e want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.
\u2014 Structure and Interpretation of Computer Programs. Abelson, Sussman, and Sussman, 1984.
"},{"location":"guides/writing-code/behave-nicely/","title":"Behave nicely","text":"Would you feel comfortable running someone else's code if you thought it might affect your other files, applications, settings, or do something else that's unexpected?
Tip
Your code should be encapsulated: it should assume as little as possible about the computer on which it is running, and it shouldn't mess with the user's environment.
Tip
Your code should follow the principal of least surprise: behave in a way that most users will expect it to behave, and not astonish or surprise them.
"},{"location":"guides/writing-code/behave-nicely/#a-cake-analogy","title":"A cake analogy","text":"Suppose you have two colleagues who regularly bake cakes, and you decide you'd like one of them to bake you a chocolate cake.
A nice colleague:
A messy colleague:
Avoid modifying files outside of the project directory!
Avoid using hard-coded absolute paths, such as C:\\Users\\My Name\\Some Project\\...
or /Users/My Name/Some other directory
. These make it harder for other people to use the code, or to run the code on high-performance computing platforms.
Prefer using paths that are relative to the root directory of your project, such as input-data/case-data/cases-for-2023.csv
. If you're using R, the here package is extremely helpful.
Warn the user before running tasks that take a long time to complete.
Notify the user before downloading large files.
A \"linter\" is a tool that checks your code for syntax errors, possible mistakes, inconsistent formatting, and other potential issues.
We strongly recommend using an editor that displays linter warnings as you write your code. Having instant feedback allows you to rapidly resolve many common issues and substantially improve your code.
We list here some of the most commonly used linters:
R: lintr
Python: ruff
Julia: Lint.jl
Think about how to cleanly structure your code. Take a similar approach to how we write papers and grants.
Break the overall problem into pieces, and then decide how to structure each piece in turn.
Divide your code into functions that each do one \"thing\", and group related functions into separate files or modules.
It can sometimes help to think about how you want the final code to look, and then design the functions and components that are needed.
Avoid global variables, aim to pass everything as function arguments. This makes the code more robust and easier to run.
Avoid passing lots of individual parameters as separate arguments, this is prone to error \u2014 you might not pass them in the correct order. Instead, collect the parameters into a single structure (e.g, a Python dictionary, an R named list).
Avoid making multiple copies of a model if you want to change some aspect of its behaviour. Instead, add a new model parameter that enables/disables this new behaviour. This allows you to use the same code to run the older and newer versions of the model.
Try to collect common or related tasks into a single script, and allow the user to select which task(s) to run, rather than creating many scripts that perform very similar tasks.
Write test cases to check key model properties.
You want to identify problems and mistakes as soon as possible!
Thinking about how to make your code testable can help you improve its structure!
Well-written tests can also demonstrate how to use your code!
Divide your code into modules, each of which does one thing (\"high cohesion\") and depends as little as possible on other pieces (\"low coupling\").
"},{"location":"guides/writing-code/cohesion-coupling/#common-project-components","title":"Common project components","text":"For example, an infectious diseases modelling project might often be divided into some of the following components:
The model parameters \u2014 what are their values or prior distributions?
The initial model state \u2014 how is this created from the model parameters?
The model equations or update rules \u2014 how does the model evolve over time?
Summary statistics \u2014 what do you want to record for each simulation? This might be the entire state history, a subset of the history, some aggregate statistics, or any combination of these things.
The input data (if any) \u2014 these may be case data, serological data, within-host specimen counts, etc.
The relationship between data and the model state (\"observation model\").
Simulated data generated from a model simulation.
As much as possible, each of these components (where relevant to your project) should be represented as a separate piece of code.
"},{"location":"guides/writing-code/cohesion-coupling/#separating-the-what-from-the-how","title":"Separating the \"what\" from the \"how\"","text":"Dividing your code into separate components is especially important if you want to use a model for multiple purposes, such as:
Tip
In particular, keep the following aspects of your project separate:
What to do: fitting to different data sets, exploring different scenarios, performing a sensitivity analysis, etc; and
How to do it: the model implementation.
If you want to explore a range of model scenarios, for example, define the parameter values (or sampling distributions) for each scenario in a separate input file. Then write a script that takes an input file name as an argument, reads the parameter values, and uses these values to run the model simulations.
This makes it extremely simple to define and run new scenarios without modifying your code.
"},{"location":"guides/writing-code/cohesion-coupling/#interactions-between-components","title":"Interactions between components","text":"Choosing how your components interact (e.g., by calling functions or passing data) is just as important as deciding how to divide your code into components.
Here are some key recommendations from Object-Oriented Software Construction (2nd ed):
Small interfaces: if two modules communicate, they should exchange as little information as possible.
Explicit interfaces: if two modules communicate, it should be obvious from the code in one or both of these modules.
Self documentation: strive to make all information about a module part of the module itself.
For languages such as R, Python, and Julia, it is generally a good idea to write your code as a package/library. This can make it easier to install and run your code on a new computer, on a high-performance computing platform, and for others to use on their own computers.
Info
This is a simple process and entirely separate from publishing your package or making it publicly available.
It also means you can avoid using source()
in R, or adding directories to sys.path
in Python.
To create a package you need to provide some necessary information, such as a package name, and the list of the packages that your code depends on (\"dependencies\"). You can then use packaging tools to verify that you've correctly identified these dependencies and that your package can be successfully installed and used!
This is an important step towards ensuring your work is reproducible.
There are some great online resources that can help you get started. We list here some widely-recommended resources for specific languages.
"},{"location":"guides/writing-code/create-packages/#writing-r-packages","title":"Writing R packages","text":"For R, see R Packages (2nd ed) and the devtools package.
Other useful references include:
Info
rOpenSci offers peer review of statistical software.
"},{"location":"guides/writing-code/create-packages/#writing-python-packages","title":"Writing Python packages","text":"The Python Packaging User Guide provides a tutorial on Packaging Python Projects.
Other useful references include:
The pyOpenSci project also provide a Python Packaging Guide. This includes information about code style, formatting, and linters.
This example Python project demonstrates one way of structuring a Python project as a package.
Info
pyOpenSci offers peer review of scientific software
"},{"location":"guides/writing-code/create-packages/#writing-julia-packages","title":"Writing Julia Packages","text":"The Julia's package manager documentation provides a guide to Creating Packages
"},{"location":"guides/writing-code/document-your-code/","title":"Document your code","text":"Writing clear, well-structured code, can make it easier for someone to understand what your code does. You might think that this means your code is so clear and obvious that it needs no further explanation.
But this is not true! There is always a role for writing comments and documentation. By itself, your code cannot always explain:
What goal you are trying to achieve;
How you are achieving this goal; and
Why you've chosen this approach.
Question
What can you do to make your code more easily understandable?
"},{"location":"guides/writing-code/document-your-code/#naming","title":"Naming","text":"Use good names for functions, parameters, and variables. This can be deceptively hard.
Quote
There are only two hard things in Computer Science: cache invalidation and naming things.
\u2014 Phil Karlton
"},{"location":"guides/writing-code/document-your-code/#explaining","title":"Explaining","text":"Have you explained the intention of your code?
Tip
Good comments don't say what the code does; instead, they explain why the code does what it does.
For each function, write a comment that explains what the function does, describes the purpose of each parameter, and describes what values the function returns (if any).
"},{"location":"guides/writing-code/document-your-code/#documenting","title":"Documenting","text":"Many programming languages support \"docstrings\". These are usually comments with additional structure and formatting, and can be used to automatically generate documentation:
R: roxygen2
Python: there are several formats
Julia: Writing Documentation
See the CodeRefinery In-code documentation lesson for some good examples of docstrings.
"},{"location":"guides/writing-code/document-your-code/#commenting-out-code","title":"Commenting out code","text":"Avoid commenting out code. If it's no longer useful, delete it and save this as a commit! Make sure you write a helpful commit message. You can always recover the deleted code if you need it later.
"},{"location":"guides/writing-code/exercise-seek-feedback/","title":"Exercise: seek feedback","text":"Question
One goal to keep in mind is to ensure your work is conceptually accessible: how readily could someone else (or even yourself, after a period of absence) understand your code?
Question
Have you ever looked at someone else's code and found it hard to read because they formatted it differently to your code?
Using a consistent code style can help make your code more legible and accessible to others, in much the same way that standard use of punctuation and spacing makes written text easier to read.
Tip
Good coding style is like using correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
\u2014 Hadley Wickham, the tidyverse style guide
We strongly recommend using an editor that can automatically format your code whenever you save. This allows you to completely forget about formatting and focus on the content.
We list here some of the most commonly used style guides and code formatters:
Language Style guide(s) Formatter R tidyverse styler Python PEP 8 and The Hitchhiker's Style Guide black Julia style guide Lint.jl"},{"location":"guides/writing-code/how-we-learn-to-write-code/","title":"How we learn to write code","text":"Question
How have you learned to write code? Were you given any formal training?
Unless you studied Software Engineering, you may never have had any formal training. And that's okay! Nobody writes perfect code.
There are various resources available (including this book) that can help you to improve your coding skills. But the most effective way to improve is to write code and get feedback.
Tip
You can practice shooting eight hours a day, but if your technique is wrong, then all you become is very good at shooting the wrong way.
\u2014 Michael Jordan
"},{"location":"guides/writing-code/how-we-learn-to-write-code/#how-we-learn-to-write-papers","title":"How we learn to write papers","text":"Throughout our research careers, we are continually learning and developing our ability to write scientific papers. One of the main ways that we develop this ability is to seek feedback early and often, by circulating drafts to co-authors, supervisors, and trusted colleagues.
This feedback not only helps us improve the paper that we're currently working on, but also improves our ability to write papers in the future.
We gradually learn how to express ourselves clearly at multiple levels:
Writing individual sentences that clearly convey a single thought or observation;
Constructing paragraphs that span a single topic or idea;
Structuring an entire paper so that the reader can easily navigate it.
Many of us learn to write code as a by-product of our chosen research area, and may not have any formal computer programming training. However, while we may make our finished code available as a support material for our published papers, we don't typically show our code to our co-authors.
Info
While there are many reasons why we are reluctant to share our code, perhaps the biggest factor is a sense of shame. We may feel that our code is \"bad\" \u2014 too bad to share with others! \u2014 and that if we've ever made a mistake in our code, we're the only person who has ever done so.
This is simply untrue!
"},{"location":"guides/writing-code/how-we-learn-to-write-code/#how-we-should-learn-to-code","title":"How we should learn to code","text":"We should treat writing code the same way that we treat writing papers, grant applications, and fellowship applications: seek feedback early, and seek feedback often.
Question
Wouldn't you prefer that the first person who looks at your code is a trusted colleague, rather than a random person who has read your paper and now wants to see how the code works?
Peer code review offers a structured way to:
Discuss and critique a person's work in a kind and supportive manner;
Praise good work;
Identify where code is well-structured and clear, and where it could be improved; and
Share relevant knowledge and expertise.
Similar to writing papers, we should seek feedback at multiple levels:
Are individual lines of code clear and correct?
Are strongly-related lines of code grouped into functions that each do a single thing?
Are functions grouped into modules that focus on specific aspects or features?
Can the reader easily navigate the code?
One of our goals for 2024 is to develop orientation materials for new students, postdocs, etc. There was broad interest in having a checklist, and example workflows for people to follow \u2014 particularly for projects that involve some form of code \"hand over\", to ensure that the recipients experience few problems in running the code themselves.
How to contribute
To suggest a new topic:
Use the search box (top of the page) and check if the topic already exists;
If the topic does not exist, submit a \"New Topic\" issue.
To suggest a useful resource: submit a \"Useful Resource\" issue.
To provide feedback about existing content: submit a \"Feedback\" issue.
To contribute new content:
Use the search box (top of the page) and check if similar content already exists;
If there is no similar content, please create a fork of this repository, add your contributions, and create a pull request.
See our How to contribute page for more details, such as formatting guidelines.
Current issues for the orientation guide are listed here.
Suggested topics included:
Note
In addition to the topical guides, the Useful resources section includes: