Merge pull request #2 from csiro-data-school/spr-0124

Updates to chapters 1-4 ahead of CSIRO Data School Jan '24
csiro-data-school · Jan 28, 2024 · a36ff42 · a36ff42
2 parents c672a88 + b8211fa
commit a36ff42
Show file tree

Hide file tree

Showing 6 changed files with 111 additions and 57 deletions.
diff --git a/episodes/02-data_management.md b/episodes/02-data_management.md
@@ -104,19 +104,16 @@ that case, record the exact procedure used to obtain the raw data,
 as well as any other pertinent information, such as an official
 version number or the date of download.
 
-If external hard drives
-are used, store them off-site of the original location. Universities
-often have their own data storage solutions, so it is worthwhile to
-consult with your local Information Technology (IT) group or
-library. Alternatively cloud computing resources, like
-Amazon Simple Storage Service (Amazon S3), Google Cloud
-Storage or [Azure](https://azure.microsoft.com/en-us/services/storage/) are
-reasonably priced and reliable. For large data sets, where storage
-and transfer can be expensive and time-consuming, you may need to
-use incremental backup or specialized storage systems, and people in
-your local IT group or library can often provide advice and
-assistance on options at your university or organization as well.
-
+In CSIRO, both raw data and final project data can (and should) be backed up on the 
+[Data Access Portal](https://data.csiro.au/) (it can host internal-only data copies 
+as well as publicly published data). We also have project-allocated Bowen storage 
+and HPC-linked high-volume storage paths that IM&T can provide use of, but check
+if any automated backup schedules are in place.  
+
+Also, remember not to place data in Git repositories. These are for code, scripts,
+notes, documentation, etc., but not data. Good data curation practices are different
+to good code curation practices.  
+
 ## Working with sensitive data
 
 Identify whether your project will work with sensitive data - by which we might mean:
@@ -128,16 +125,12 @@ Identify whether your project will work with sensitive data - by which we might
 It is important to understand the restrictions which may apply when working with sensitive data, and also ensure that your project complies with any applicable laws relating to storage, use and sharing of sensitive data (for example, laws like the General Data Protection Regulation, known as the GDPR).
 These laws vary between countries and may affect whether you can share information between collaborators in different countries.
 
-## Create the data you wish to see in the world
-
-:::::::::::::::::::::::::::::::::::::::  challenge
-
-## Discussion (2 minutes)
-
-Which file formats do you store your data in? Enter your answers in the collaborative document.
-
+You should have completed some compulsory online training on working with sensitive data within
+CSIRO.  
+The [Australian Research Data Commons \(ARDC\)](https://ardc.edu.au/resource-hub/working-with-sensitive-data/)
+provides some additional guides for working with sensitive data within Australia.  
 
-::::::::::::::::::::::::::::::::::::::::::::::::::
+## Create the data you wish to see in the world
 
 *Filenames*: Store especially useful metadata as part of the
 filename itself, while keeping the filename regular enough for easy
@@ -177,7 +170,7 @@ transformations that we recommend at the beginning of analysis:
 
 ## Discussion (2 minutes)
 
-Which of the table layouts is analysis friendly? Discuss. Enter your answers in the collaborative document.
+Which of the table layouts is analysis friendly? Discuss. 
 ![](fig/wilson-tidy-data.png){alt="Two tables of data appear side-by-side. The table on the left has columns named site, 1999, and 2000. The table on the right has columns named site, year, and cases."}
 
 
@@ -230,7 +223,7 @@ chosen as a set of boundary coordinates.
 
 ## How, when and why do you document?
 
-As much as possible, always and to help you future self.
+As much as possible, always and to **help you future self**.
 
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
@@ -259,24 +252,26 @@ Which of the following places would be good places to share your data?
 
 - Personal/lab web-site
 - GitHub
-- General repo (i.e.: Zenodo, Data Dryad, etc.)
-- Community specific repo (i.e.: ArrayExpress, SRA, EGA, PRIDE, etc.)
+- General data repo (i.e.: Zenodo, Data Dryad, etc.)
+- Community specific repo (i.e.: ArrayExpress, NCBI, SRA, EGA, PRIDE, etc.)
+- DAP
 
 :::::::::::::::  solution
 
 ## Solution
 
-- Personal/lab web-site: this is not the best place to store your data long-term. These websites are not hosted long term. You can have a link to the repo, though.
-- GitHub: in itself it is not proper for sharing your data as it can be modified. However, a snapshot of a Github repository can be stored in Zenodo and be issued a DOI.
-- General repo (i.e.: Zenodo, Data Dryad, etc.): good option to deposit data that does not fit in a specific repository. Best if the service is non-commerical, has long-termdata archival and issues DOIs, such as Zenodo.
-- Community specific repo (i.e.: ArrayExpress, SRA, EGA, PRIDE, etc.): best option to share your data, if your research community has come up with a sustainable long-term repository.
+- Personal/lab web-site: not the best place to store your data long-term. These websites are not hosted long term. You can have a link to the repo, though.
+- GitHub: in itself it is not proper for sharing your data as it can be modified. A snapshot of a Github repository can be stored in a service like Zenodo or the DAP and be issued a DOI.
+- General data repo (i.e.: Zenodo, Data Dryad, etc.): good option to deposit data that does not fit in a specific repository. Best if the service is non-commerical, has long-termdata archival and issues DOIs, such as Zenodo. But DAP is preferred for CSIRO employees.
+- Community specific repo (i.e.: ArrayExpress, NCBI, SRA, EGA, PRIDE, etc.): good option, if your research community has come up with a sustainable long-term repository for a certain data type.
+- DAP: generally best option for any data sharing for CSIRO employees. 
 
 :::::::::::::::::::::::::
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
 Your data is as much a product of your research as the papers you write, and just as likely to be useful to others (if not more so).
-Sites such as [Dryad](https://datadryad.org) and [Zenodo](https://zenodo.org) allow others to find your work, use it, and cite it; we discuss licensing in the episode on collaboration [04-collaboration].
+Services such as the [DAP](https://data.csiro.au/) can allow others to find your work, use it, and cite it; we discuss licensing in the episode on collaboration [04-collaboration].
 Follow your research community's standards for how to provide metadata.
 Note that there are two types of metadata: metadata about the dataset as a whole and metadata about the content within the dataset.
 If the audience is humans, write the metadata (the README file) for humans.
@@ -288,7 +283,8 @@ If the audience includes automatic metadata harvesters, fill out the formal meta
 
 - A digital object identifier is a persistent identifier or handle used to identify objects uniquely.
 - Data with a persistent DOI can be found even when your lab website dies.
-- doi-issuing repositories include: zenodo, figshare, dryad.
+- doi-issuing repositories include: zenodo, figshare, dryad and the DAP.
+- e.g. [https://doi.org/10.25919/t1ad-8k76](https://doi.org/10.25919/t1ad-8k76)
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -300,6 +296,7 @@ If the audience includes automatic metadata harvesters, fill out the formal meta
 - Zenodo ([http://zenodo.org](https://zenodo.org)): A repository service that enables researchers, scientists, projects, and institutions to share and showcase multidisciplinary research results (data and publications)
 - Dryad ([http://datadryad.org](https://datadryad.org)): A repository that aims to make data archiving as simple and as rewarding as possible through a suite of services not necessarily provided by publishers or institutional websites.
 - Dataverse ([http://thedata.org](https://thedata.org)): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.
+- The DAP ([https://data.csiro.au/](https://data.csiro.au/)): CSIRO's own data repository.
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
@@ -318,7 +315,7 @@ Many funders provide basic templates for writing a DMP, along with guidelines on
 
 ## Discussion (2 minutes)
 
-Aside from being a requirement, there are many benefits of writing a DMP to researchers. What sort of benefits do you think there are? Enter your answers in the collaborative document.
+Aside from being a requirement, there are many benefits of writing a DMP to researchers. What sort of benefits do you think there are? 
 
 :::::::::::::::  solution
 
@@ -339,6 +336,8 @@ Often research institutions provide support for DMPs, e.g. through library servi
 
 More resources on data management plans are available at [DMP online](https://dmponline.dcc.ac.uk).
 
+The CSIRO's [Research Data Planner](https://rdp.csiro.au/) can help you prepare a data management plan. It will be covered in detail next week.  
+
 ::::::::::::::::::::::::::::::::::::::  discussion
 
 ## What's your next step in data management?

diff --git a/episodes/03-software.md b/episodes/03-software.md
@@ -53,7 +53,7 @@ There are extended discussions about research software at the [Software Sustaina
 
 **Discussion**
 
-What can go wrong with writing research code?
+What can go wrong when trying to reuse research code?
 
 :::::::::::::::  solution
 
@@ -239,23 +239,20 @@ count_fruit_on_island = function(fruit type, island)
     return total fruit
 ```
 
-Write the commands to call this function to count how many coconuts there are on Sam's island, how many cherries
-there are on Sam's island, and how many cherries there are on Charlie's island.
+Write a pseudocode command to call this function to count how many coconuts there are on Sam's island.
 
 Write a pseudocode for loop like the one above that uses this function to count all the cherries on every island.
 
 :::::::::::::::  solution
 
 ## Solution
 
+To count Sam's island's coconuts:
 ```source
 sams coconuts = count_fruit_on_island(coconuts, Sam's island)
-sams cherries = count_fruit_on_island(cherries, Sam's island)
-charlies cherries = count_fruit_on_island(cherries, Charlie's island)
 ```
 
 To count all the cherries on every island:
-
 ```source
 total cherries = 0
 for every island
@@ -328,13 +325,13 @@ What are the most meaningful names for `functionName` and `variableName`? Choose
 
 1. processFunction - incorrect, too vague
 2. computeCubesOfThird - incorrect, doesn't imply every third in sequence
-3. cubeEveryThirdNumberInASequence - incorrect, too long
+3. cubeEveryThirdNumberInASequence - maybe, but too long
 4. **cubeEachThird - correct, short and includes information on the data and calculation performed**
-5. 3rdCubed - incorrect, bad practice to put a number at the beginning of a function name (and not allowed by some programming languages)
+5. 3rdCubed - incorrect, bad practice to put a number at the beginning of a function name (not allowed by some programming languages)
 
 `variableName`
 
-1. arrayOfNumbersToBeCubed - incorrect, too long
+1. arrayOfNumbersToBeCubed - maybe, but too long
 2. input - incorrect, too vague
 3. **numericSequence - correct, short and included information about the type of input**
 4. S - incorrect, too vague

diff --git a/episodes/04-collaboration.md b/episodes/04-collaboration.md
@@ -152,30 +152,66 @@ This Software Project README:
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::
 
-## Create a shared "to-do" list
-
-This can be a plain text
-file called something like `notes.txt` or `todo.txt`, or you can use
-sites such as GitHub or Bitbucket to create a new *issue* for each
-to-do item. (You can even add labels such as "low hanging fruit" to
-point newcomers at issues that are good starting points.) Whatever
-you choose, describe the items clearly so that they make sense to
-newcomers.
-
 ## Decide on communication strategies
 
-Make explicit
-decisions about (and publicize where appropriate) how members of the
+Make explicit decisions about (and publicize where appropriate) how members of the
 project will communicate with each other and with external users /
 collaborators. This includes the location and technology for email
 lists, chat channels, voice / video conferencing, documentation, and
 meeting notes, as well as which of these channels will be public or
 private.
 
+Supported tools within CSIRO include [Confluence](https://confluence.csiro.au/),
+for wiki-style shared documentation, and Microsoft Teams, for online discussions,
+video conferencing, file sharing and more.
+
+Not supported, but worth knowing of is [Miro](https://miro.com/), which allows creating 
+a shared, online whiteboard, where multiple people may be building up notes, diagrams,
+graphs, etc., at the same time, with quite powerful tools.  
+
 ## Collaborations with sensitive data
 
 If you determine that your project will include work with sensitive data, it is important to agree with collaborators on how and where the data will be stored, as well as what the mechanisms for sharing the data will be and who is ultimately responsible for ensuring these are followed.
 
+## Create a shared "to-do" list
+
+Organising a structured to-do list of tasks still to complete and overall project work
+plan can be really useful whether collaborating with others or just with your future self.
+There are many options and tools available for how to do this.
+
+- If your project is centered around a Git-tracked repository, it could be as simple
+as updating a text file like `notes.txt` or `todo.txt`.
+
+- Or make use of the ability to track "issues" on 
+[BitBucket](https://support.atlassian.com/bitbucket-cloud/docs/understand-bitbucket-issues/)
+or [GitHub](https://docs.github.com/en/issues/tracking-your-work-with-issues/about-issues). 
+The "issues" feature on online Git repositories allows you or others to describe work that needs
+to be done (often used on public repositories for reporting bugs in software), 
+create and follow discussions about the issues/tasks, and link the closing of the issues 
+to commits/pull-requests.
+E.g.: [Issues for original version of this lesson](https://github.com/carpentries-lab/good-enough-practices/issues)
+
+- Microsoft Teams 'Tasks'. Teams now lets you 
+[add a 'Tasks' app/tab to a team space](https://support.microsoft.com/en-au/office/use-the-tasks-app-in-teams-e32639f3-2e07-4b62-9a8c-fd706c12c070), 
+which then lets create a to-do list, assign tasks to people, and lets you track and view the
+tasks in various ways.
+![](fig/ms-tasks-list-view.png){alt="An example of Teams Tasks list view"}
+
+- [Jira](https://jira.csiro.au/) is another tool supported and deployed in CSIRO. Developed by
+Australian software company [Atlassian](https://www.atlassian.com/software/jira, it allows 
+tracking of to-do tasks/issues and sub-tasks, lets you assign tasks to people, and lets 
+you track and view tasks in the context of worflows, timelines, and "board" visualisations,
+such as the "Kanban board". Jira can directly integrate with both BitBucket
+and Confluence, with Jira tasks able to be linked, referenced and tracked in each.
+Jira can be a very powerful tool if fully embraced, but can also be a bit clunky to starting out.
+
+![](fig/jira-kanban-example.png){alt="An example of a Jira Kanban board"}
+A Kanban board example, from [Jira's website](https://www.atlassian.com/software/jira/templates/scrum).  
+
+![](fig/jira-backlog-example.png){alt="An example of a Jira Backlog"}
+A "backlog" view example, from [Jira's website](https://www.atlassian.com/software/jira/templates/scrum).  
+
+
 ## Make the license explicit
 
 :::::::::::::::::::::::::::::::::::::::::  callout
@@ -197,8 +233,10 @@ explicit license does not mean there isn't one; rather, it implies
 the author is keeping all rights and others are not allowed to
 re-use or modify the material.
 A project that consists of data and text may benefit from a different license to a project consisting primarily of code.
+
+**IM&T can help with advising on suitable licenses.**
 
-We recommend Creative Commons licenses for data and text, either
+The original authors of this lesson recommend Creative Commons licenses for data and text, either
 [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) (the "No Rights Reserved"
 license) or [CC-BY](https://wellcome.org/grant-funding/guidance/creative-commons-attribution-licence-cc) (the "Attribution"
 license, which permits sharing and reuse but requires people to give
@@ -209,7 +247,7 @@ A useful resource to compare different licenses is available at [tldrlegal](http
 
 > **What Not To Do**
 > 
-> We recommend *against* the "no commercial use" variations of the
+> We (the original authors) recommend *against* the "no commercial use" variations of the
 > Creative Commons licenses because they may impede some forms of
 > re-use. For example, if a researcher in a developing country is
 > being paid by her government to compile a public health report,
@@ -221,7 +259,7 @@ A useful resource to compare different licenses is available at [tldrlegal](http
 
 ## Make the project citable
 
-A `CITATION` file describes how to cite this
+A `CITATION` file describes how to cite your
 project as a whole, and where to find (and how to cite) any data
 sets, code, figures, and other artifacts that have their own DOIs.
 The example below shows the `CITATION` file for the
@@ -235,8 +273,28 @@ Please cite this work as:
 Morris, B.D. and E.P. White. 2013. "The EcoData Retriever:
 improving access to existing ecological data." PLOS ONE 8:e65848.
 http://doi.org/doi:10.1371/journal.pone.0065848
+```  
+
+More recently a standard for citation files has been developed in the form of CFF;
+Citation File Format. Often saved in a CITATION.cff file, this format was proposed 
+specifically for the purpose of holding expected information as a standardised set 
+that was both human and machine readable. E.g.:
+```
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+authors:
+  - family-names: Druskat
+    given-names: Stephan
+    orcid: https://orcid.org/1234-5678-9101-1121
+title: "My Research Software"
+version: 2.0.4
+doi: 10.5281/zenodo.1234
+date-released: 2021-08-11
 ```
 
+More information on CFF is available here:  [citation-file-format.github.io](https://citation-file-format.github.io/)
+
+
 ## Recommended resources
 
 - [The Turing Way Guide for Collaboration](https://the-turing-way.netlify.app/collaboration/collaboration.html)

diff --git a/episodes/fig/jira-backlog-example.png b/episodes/fig/jira-backlog-example.png
diff --git a/episodes/fig/jira-kanban-example.png b/episodes/fig/jira-kanban-example.png
diff --git a/episodes/fig/ms-tasks-list-view.png b/episodes/fig/ms-tasks-list-view.png