Skip to content

Commit

Permalink
Merge pull request #2 from csiro-data-school/spr-0124
Browse files Browse the repository at this point in the history
Updates to chapters 1-4 ahead of CSIRO Data School Jan '24
  • Loading branch information
spriggsy83 authored Jan 28, 2024
2 parents c672a88 + b8211fa commit a36ff42
Show file tree
Hide file tree
Showing 6 changed files with 111 additions and 57 deletions.
65 changes: 32 additions & 33 deletions episodes/02-data_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,19 +104,16 @@ that case, record the exact procedure used to obtain the raw data,
as well as any other pertinent information, such as an official
version number or the date of download.

If external hard drives
are used, store them off-site of the original location. Universities
often have their own data storage solutions, so it is worthwhile to
consult with your local Information Technology (IT) group or
library. Alternatively cloud computing resources, like
Amazon Simple Storage Service (Amazon S3), Google Cloud
Storage or [Azure](https://azure.microsoft.com/en-us/services/storage/) are
reasonably priced and reliable. For large data sets, where storage
and transfer can be expensive and time-consuming, you may need to
use incremental backup or specialized storage systems, and people in
your local IT group or library can often provide advice and
assistance on options at your university or organization as well.

In CSIRO, both raw data and final project data can (and should) be backed up on the
[Data Access Portal](https://data.csiro.au/) (it can host internal-only data copies
as well as publicly published data). We also have project-allocated Bowen storage
and HPC-linked high-volume storage paths that IM&T can provide use of, but check
if any automated backup schedules are in place.

Also, remember not to place data in Git repositories. These are for code, scripts,
notes, documentation, etc., but not data. Good data curation practices are different
to good code curation practices.

## Working with sensitive data

Identify whether your project will work with sensitive data - by which we might mean:
Expand All @@ -128,16 +125,12 @@ Identify whether your project will work with sensitive data - by which we might
It is important to understand the restrictions which may apply when working with sensitive data, and also ensure that your project complies with any applicable laws relating to storage, use and sharing of sensitive data (for example, laws like the General Data Protection Regulation, known as the GDPR).
These laws vary between countries and may affect whether you can share information between collaborators in different countries.

## Create the data you wish to see in the world

::::::::::::::::::::::::::::::::::::::: challenge

## Discussion (2 minutes)

Which file formats do you store your data in? Enter your answers in the collaborative document.

You should have completed some compulsory online training on working with sensitive data within
CSIRO.
The [Australian Research Data Commons \(ARDC\)](https://ardc.edu.au/resource-hub/working-with-sensitive-data/)
provides some additional guides for working with sensitive data within Australia.

::::::::::::::::::::::::::::::::::::::::::::::::::
## Create the data you wish to see in the world

*Filenames*: Store especially useful metadata as part of the
filename itself, while keeping the filename regular enough for easy
Expand Down Expand Up @@ -177,7 +170,7 @@ transformations that we recommend at the beginning of analysis:

## Discussion (2 minutes)

Which of the table layouts is analysis friendly? Discuss. Enter your answers in the collaborative document.
Which of the table layouts is analysis friendly? Discuss.
![](fig/wilson-tidy-data.png){alt="Two tables of data appear side-by-side. The table on the left has columns named site, 1999, and 2000. The table on the right has columns named site, year, and cases."}


Expand Down Expand Up @@ -230,7 +223,7 @@ chosen as a set of boundary coordinates.

## How, when and why do you document?

As much as possible, always and to help you future self.
As much as possible, always and to **help you future self**.


::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down Expand Up @@ -259,24 +252,26 @@ Which of the following places would be good places to share your data?

- Personal/lab web-site
- GitHub
- General repo (i.e.: Zenodo, Data Dryad, etc.)
- Community specific repo (i.e.: ArrayExpress, SRA, EGA, PRIDE, etc.)
- General data repo (i.e.: Zenodo, Data Dryad, etc.)
- Community specific repo (i.e.: ArrayExpress, NCBI, SRA, EGA, PRIDE, etc.)
- DAP

::::::::::::::: solution

## Solution

- Personal/lab web-site: this is not the best place to store your data long-term. These websites are not hosted long term. You can have a link to the repo, though.
- GitHub: in itself it is not proper for sharing your data as it can be modified. However, a snapshot of a Github repository can be stored in Zenodo and be issued a DOI.
- General repo (i.e.: Zenodo, Data Dryad, etc.): good option to deposit data that does not fit in a specific repository. Best if the service is non-commerical, has long-termdata archival and issues DOIs, such as Zenodo.
- Community specific repo (i.e.: ArrayExpress, SRA, EGA, PRIDE, etc.): best option to share your data, if your research community has come up with a sustainable long-term repository.
- Personal/lab web-site: not the best place to store your data long-term. These websites are not hosted long term. You can have a link to the repo, though.
- GitHub: in itself it is not proper for sharing your data as it can be modified. A snapshot of a Github repository can be stored in a service like Zenodo or the DAP and be issued a DOI.
- General data repo (i.e.: Zenodo, Data Dryad, etc.): good option to deposit data that does not fit in a specific repository. Best if the service is non-commerical, has long-termdata archival and issues DOIs, such as Zenodo. But DAP is preferred for CSIRO employees.
- Community specific repo (i.e.: ArrayExpress, NCBI, SRA, EGA, PRIDE, etc.): good option, if your research community has come up with a sustainable long-term repository for a certain data type.
- DAP: generally best option for any data sharing for CSIRO employees.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

Your data is as much a product of your research as the papers you write, and just as likely to be useful to others (if not more so).
Sites such as [Dryad](https://datadryad.org) and [Zenodo](https://zenodo.org) allow others to find your work, use it, and cite it; we discuss licensing in the episode on collaboration [04-collaboration].
Services such as the [DAP](https://data.csiro.au/) can allow others to find your work, use it, and cite it; we discuss licensing in the episode on collaboration [04-collaboration].
Follow your research community's standards for how to provide metadata.
Note that there are two types of metadata: metadata about the dataset as a whole and metadata about the content within the dataset.
If the audience is humans, write the metadata (the README file) for humans.
Expand All @@ -288,7 +283,8 @@ If the audience includes automatic metadata harvesters, fill out the formal meta

- A digital object identifier is a persistent identifier or handle used to identify objects uniquely.
- Data with a persistent DOI can be found even when your lab website dies.
- doi-issuing repositories include: zenodo, figshare, dryad.
- doi-issuing repositories include: zenodo, figshare, dryad and the DAP.
- e.g. [https://doi.org/10.25919/t1ad-8k76](https://doi.org/10.25919/t1ad-8k76)

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -300,6 +296,7 @@ If the audience includes automatic metadata harvesters, fill out the formal meta
- Zenodo ([http://zenodo.org](https://zenodo.org)): A repository service that enables researchers, scientists, projects, and institutions to share and showcase multidisciplinary research results (data and publications)
- Dryad ([http://datadryad.org](https://datadryad.org)): A repository that aims to make data archiving as simple and as rewarding as possible through a suite of services not necessarily provided by publishers or institutional websites.
- Dataverse ([http://thedata.org](https://thedata.org)): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.
- The DAP ([https://data.csiro.au/](https://data.csiro.au/)): CSIRO's own data repository.

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -318,7 +315,7 @@ Many funders provide basic templates for writing a DMP, along with guidelines on

## Discussion (2 minutes)

Aside from being a requirement, there are many benefits of writing a DMP to researchers. What sort of benefits do you think there are? Enter your answers in the collaborative document.
Aside from being a requirement, there are many benefits of writing a DMP to researchers. What sort of benefits do you think there are?

::::::::::::::: solution

Expand All @@ -339,6 +336,8 @@ Often research institutions provide support for DMPs, e.g. through library servi

More resources on data management plans are available at [DMP online](https://dmponline.dcc.ac.uk).

The CSIRO's [Research Data Planner](https://rdp.csiro.au/) can help you prepare a data management plan. It will be covered in detail next week.

:::::::::::::::::::::::::::::::::::::: discussion

## What's your next step in data management?
Expand Down
15 changes: 6 additions & 9 deletions episodes/03-software.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ There are extended discussions about research software at the [Software Sustaina

**Discussion**

What can go wrong with writing research code?
What can go wrong when trying to reuse research code?

::::::::::::::: solution

Expand Down Expand Up @@ -239,23 +239,20 @@ count_fruit_on_island = function(fruit type, island)
return total fruit
```

Write the commands to call this function to count how many coconuts there are on Sam's island, how many cherries
there are on Sam's island, and how many cherries there are on Charlie's island.
Write a pseudocode command to call this function to count how many coconuts there are on Sam's island.

Write a pseudocode for loop like the one above that uses this function to count all the cherries on every island.

::::::::::::::: solution

## Solution

To count Sam's island's coconuts:
```source
sams coconuts = count_fruit_on_island(coconuts, Sam's island)
sams cherries = count_fruit_on_island(cherries, Sam's island)
charlies cherries = count_fruit_on_island(cherries, Charlie's island)
```

To count all the cherries on every island:

```source
total cherries = 0
for every island
Expand Down Expand Up @@ -328,13 +325,13 @@ What are the most meaningful names for `functionName` and `variableName`? Choose

1. processFunction - incorrect, too vague
2. computeCubesOfThird - incorrect, doesn't imply every third in sequence
3. cubeEveryThirdNumberInASequence - incorrect, too long
3. cubeEveryThirdNumberInASequence - maybe, but too long
4. **cubeEachThird - correct, short and includes information on the data and calculation performed**
5. 3rdCubed - incorrect, bad practice to put a number at the beginning of a function name (and not allowed by some programming languages)
5. 3rdCubed - incorrect, bad practice to put a number at the beginning of a function name (not allowed by some programming languages)

`variableName`

1. arrayOfNumbersToBeCubed - incorrect, too long
1. arrayOfNumbersToBeCubed - maybe, but too long
2. input - incorrect, too vague
3. **numericSequence - correct, short and included information about the type of input**
4. S - incorrect, too vague
Expand Down
88 changes: 73 additions & 15 deletions episodes/04-collaboration.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,30 +152,66 @@ This Software Project README:

::::::::::::::::::::::::::::::::::::::::::::::::::

## Create a shared "to-do" list

This can be a plain text
file called something like `notes.txt` or `todo.txt`, or you can use
sites such as GitHub or Bitbucket to create a new *issue* for each
to-do item. (You can even add labels such as "low hanging fruit" to
point newcomers at issues that are good starting points.) Whatever
you choose, describe the items clearly so that they make sense to
newcomers.

## Decide on communication strategies

Make explicit
decisions about (and publicize where appropriate) how members of the
Make explicit decisions about (and publicize where appropriate) how members of the
project will communicate with each other and with external users /
collaborators. This includes the location and technology for email
lists, chat channels, voice / video conferencing, documentation, and
meeting notes, as well as which of these channels will be public or
private.

Supported tools within CSIRO include [Confluence](https://confluence.csiro.au/),
for wiki-style shared documentation, and Microsoft Teams, for online discussions,
video conferencing, file sharing and more.

Not supported, but worth knowing of is [Miro](https://miro.com/), which allows creating
a shared, online whiteboard, where multiple people may be building up notes, diagrams,
graphs, etc., at the same time, with quite powerful tools.

## Collaborations with sensitive data

If you determine that your project will include work with sensitive data, it is important to agree with collaborators on how and where the data will be stored, as well as what the mechanisms for sharing the data will be and who is ultimately responsible for ensuring these are followed.

## Create a shared "to-do" list

Organising a structured to-do list of tasks still to complete and overall project work
plan can be really useful whether collaborating with others or just with your future self.
There are many options and tools available for how to do this.

- If your project is centered around a Git-tracked repository, it could be as simple
as updating a text file like `notes.txt` or `todo.txt`.

- Or make use of the ability to track "issues" on
[BitBucket](https://support.atlassian.com/bitbucket-cloud/docs/understand-bitbucket-issues/)
or [GitHub](https://docs.github.com/en/issues/tracking-your-work-with-issues/about-issues).
The "issues" feature on online Git repositories allows you or others to describe work that needs
to be done (often used on public repositories for reporting bugs in software),
create and follow discussions about the issues/tasks, and link the closing of the issues
to commits/pull-requests.
E.g.: [Issues for original version of this lesson](https://github.com/carpentries-lab/good-enough-practices/issues)

- Microsoft Teams 'Tasks'. Teams now lets you
[add a 'Tasks' app/tab to a team space](https://support.microsoft.com/en-au/office/use-the-tasks-app-in-teams-e32639f3-2e07-4b62-9a8c-fd706c12c070),
which then lets create a to-do list, assign tasks to people, and lets you track and view the
tasks in various ways.
![](fig/ms-tasks-list-view.png){alt="An example of Teams Tasks list view"}

- [Jira](https://jira.csiro.au/) is another tool supported and deployed in CSIRO. Developed by
Australian software company [Atlassian](https://www.atlassian.com/software/jira, it allows
tracking of to-do tasks/issues and sub-tasks, lets you assign tasks to people, and lets
you track and view tasks in the context of worflows, timelines, and "board" visualisations,
such as the "Kanban board". Jira can directly integrate with both BitBucket
and Confluence, with Jira tasks able to be linked, referenced and tracked in each.
Jira can be a very powerful tool if fully embraced, but can also be a bit clunky to starting out.

![](fig/jira-kanban-example.png){alt="An example of a Jira Kanban board"}
A Kanban board example, from [Jira's website](https://www.atlassian.com/software/jira/templates/scrum).

![](fig/jira-backlog-example.png){alt="An example of a Jira Backlog"}
A "backlog" view example, from [Jira's website](https://www.atlassian.com/software/jira/templates/scrum).


## Make the license explicit

::::::::::::::::::::::::::::::::::::::::: callout
Expand All @@ -197,8 +233,10 @@ explicit license does not mean there isn't one; rather, it implies
the author is keeping all rights and others are not allowed to
re-use or modify the material.
A project that consists of data and text may benefit from a different license to a project consisting primarily of code.

**IM&T can help with advising on suitable licenses.**

We recommend Creative Commons licenses for data and text, either
The original authors of this lesson recommend Creative Commons licenses for data and text, either
[CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) (the "No Rights Reserved"
license) or [CC-BY](https://wellcome.org/grant-funding/guidance/creative-commons-attribution-licence-cc) (the "Attribution"
license, which permits sharing and reuse but requires people to give
Expand All @@ -209,7 +247,7 @@ A useful resource to compare different licenses is available at [tldrlegal](http

> **What Not To Do**
>
> We recommend *against* the "no commercial use" variations of the
> We (the original authors) recommend *against* the "no commercial use" variations of the
> Creative Commons licenses because they may impede some forms of
> re-use. For example, if a researcher in a developing country is
> being paid by her government to compile a public health report,
Expand All @@ -221,7 +259,7 @@ A useful resource to compare different licenses is available at [tldrlegal](http
## Make the project citable

A `CITATION` file describes how to cite this
A `CITATION` file describes how to cite your
project as a whole, and where to find (and how to cite) any data
sets, code, figures, and other artifacts that have their own DOIs.
The example below shows the `CITATION` file for the
Expand All @@ -235,8 +273,28 @@ Please cite this work as:
Morris, B.D. and E.P. White. 2013. "The EcoData Retriever:
improving access to existing ecological data." PLOS ONE 8:e65848.
http://doi.org/doi:10.1371/journal.pone.0065848
```

More recently a standard for citation files has been developed in the form of CFF;
Citation File Format. Often saved in a CITATION.cff file, this format was proposed
specifically for the purpose of holding expected information as a standardised set
that was both human and machine readable. E.g.:
```
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Druskat
given-names: Stephan
orcid: https://orcid.org/1234-5678-9101-1121
title: "My Research Software"
version: 2.0.4
doi: 10.5281/zenodo.1234
date-released: 2021-08-11
```

More information on CFF is available here: [citation-file-format.github.io](https://citation-file-format.github.io/)


## Recommended resources

- [The Turing Way Guide for Collaboration](https://the-turing-way.netlify.app/collaboration/collaboration.html)
Expand Down
Binary file added episodes/fig/jira-backlog-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added episodes/fig/jira-kanban-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added episodes/fig/ms-tasks-list-view.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a36ff42

Please sign in to comment.