forked from emlab-ucsb/SOP
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-files.Rmd
196 lines (132 loc) · 12.6 KB
/
02-files.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File Structure
This section details the <UNK>'s organizational stucture for files and data stored on OneDrive and code stored on GitHub.
## Folder Naming
There are some general style conventions that should be used when naming folders. For any new folder, be desciptive but concise, avoid spaces, avoid uppercase (or camel-case), and avoid special characters other than `-`. Words should be separated with `-`, so an example folder name could be `blue-paradox-paper`.
This folder naming convention should be used for any folder added to the <UNK> Shared Drive or GitHub.
## Shared Drive Files
### General Structure
![](images/shared_drive_overview.png)
<!-- ``` -->
<!-- OneDrive -->
<!-- |__ My Drive -->
<!-- | |__ ... whatever files you have on your OneDrive ... -->
<!-- |__ Shared drives -->
<!-- |__ <UNK> -->
<!-- |__ Central <UNK> Resources -->
<!-- |__ Communications -->
<!-- |__ Data -->
<!-- |__ Projects -->
<!-- |__ Strategy -->
<!-- ``` -->
The <UNK> Shared Drive is organized into five main folders:
* `central-<UNK>-resources`: includes meeting and event information, onboarding materials, information about travel reimbursements, and the team roster
* `communications`: includes the blog schedule, Adobe design projects, PowerPoint templates, photo repository, <UNK> logos, and publication and media tracking
* `data`: includes the <UNK> data directory and all datasets we work with (see section 2.2.3 for more on this)
* `projects`: includes information on past and current projects, and project management guidelines
* `strategy`: includes information about <UNK>'s strategic plan
A full table of contents can be seen [here](https://docs.google.com/document/d/1a26a6N4akF2dSXWfp0fWxGegd0wvrPVUeaA09en2EPE/edit).
### Project Folder Structure
![](images/shared_drive_project.png)
<!-- ``` -->
<!-- OneDrive -->
<!-- |__ Shared drives -->
<!-- |__ <UNK> -->
<!-- |__ Projects -->
<!-- |__ Archived Projects -->
<!-- |__ Current Projects -->
<!-- | |__ example-project -->
<!-- | | |__ Data -->
<!-- | | |__ Deliverables -->
<!-- | | |__ Grant Reporting -->
<!-- | | |__ Meetings and Events -->
<!-- | | |__ Presentations -->
<!-- | | |__ Project Materials -->
<!-- | |__ ... other projects ... -->
<!-- |__ Project Management Resources -->
<!-- |__ Scoping -->
<!-- ``` -->
Each project folder must contain the following 6 folders:
* `data`: This data folder will contain a `data_overview` spreadsheet and all of the intermediate datasets as well as output datasets associated with the project (see section 2.2.3 for more on this). Be sure to also add a copy of your final datasets to the `<UNK>/data` folder and data directory.
* `deliverables`: final reports, paper manuscripts, other final deliverables not related to data outputs
* `grant-reporting`: grant reports for funders
* `meetings-and-events`: meeting notes, agendas, documentation for workshop/event planning
* `presentations`: any presentations created for the project
* `project-materials`: everything else that does not fit into one of these folders (i.e. drafts of methods, literature review, etc.)
From here, each project can add sub-folders as they see fit within these 6 folders.
### Data Storage
As stated above, there are two locations in which data can be stored. The two locations are `OneDrive/Shared drives/<UNK>/projects/example-project/data` and `OneDrive/Shared drives/<UNK>/data`. This may seem confusing and redundant, but this section explains the differences between these two locations. As a short summary: `example-project/data` may contain raw, cleaned, intermediate, and output files for a given project, and will be used as the "workspace" while the project develops. On the other hand, `<UNK>/data` contains only (raw) input and output data from a finalized project. More detail is provided in the subsequent sections.
#### `example-project/data`
**All** data used in a project should live in this project-specific repository. To help keep track of project data, we highly recommend creating a `data_overview` spreadsheet (see section 3.3.2 for more information). This overview spreadsheet will provide a centralized summary of the data inputs and outputs for a project, and also allow teams to keep track of the status of adding the data to the `<UNK>/data` folder and creating the necessary metadata documentation. This repository will typically contain subfolders for `raw`, `processed` (or `clean`), and `output`, although each team might make slight modifications to this structure to suit their needs.
To illustrate how each of these subfolders might be used, consider the following. A team may receive data from partners, extract data from external sources, compile survey responses, create a new dataset from a literature review, or use results from previous projects as input. These data are termed “raw data” and should never be directly modified - **all of the errors, mistakes, and gremlins should be kept in the original versions**. Instead, they should be processed / cleaned, and then exported as “clean data” that is actually used in analyses. The script used to do the processing / cleaning then acts as a reproducable record of everything that was done to the raw data.
Suppose that a team working in Montserrat is tasked to perform a stock assessment on lobster populations and receives a database of lobster landings from the government. These data are stored as an excel spreadsheet, and will surely contain many mistakes that need to be fixed prior to running anly analyses. The team will clean the data (preferabily, using a reproducible script), and then export a new version of the data in which the mistakes have been fixed. The team will then perform the stock assessment, and produce results before reporting back. Therefore, the project-level data folder could be subdivided into `raw`, `clean`, and `output` folders. The first one will contain the excel file recieved from the government. The second folder will contain the cleaned data (perhaps exported as a csv), which can then be used as input for analyses within this project. The `output` folder will then contain the stock assessment results that might be relevant to other projects.
![](images/shared_drive_project_data.png)
<!-- ``` -->
<!-- Projects -->
<!-- |__ Current Projects -->
<!-- |__ montserrat-project -->
<!-- | |__ Data -->
<!-- | |__ raw -->
<!-- | | |__ lobster_landings_nov_2012.xslx -->
<!-- | |__ clean -->
<!-- | | |__ lobster_landings_nov_2012.csv -->
<!-- | |__ output -->
<!-- | |__ lobster_stock_assessment.csv -->
<!-- |__ ... other projects ... -->
<!-- ``` -->
As stated above, since the `output` folder could contain information relevant to other projects, this data should be made available to other <UNK> projects once the project is complete. To do this, any `output` data (and `raw` data if it is not already there) should be moved to the `<UNK>/data` folder, as described below.
#### `<UNK>/data`
As a general rule, this folder contains all data *used* and *produced* by <UNK> projects. The idea is to make it easier for people to find data that has been used in previous projects, as well as to use previous results as inputs for new projects.
To illustrate types of data that should be in the `<UNK>/data` folder, consider the following. The [RAM Legacy stock assessment database](https://www.ramlegacy.org/) is key to many projects, and was used as input in the [Costello *et al.* 2016 "upsides" paper](https://www.pnas.org/content/113/18/5125). The "upsides database" is an output from the Costello paper, which has then been used as input for other projects. Therefore, the `<UNK>/data` folder contains separate folders for both the RAM and upsides datasets.
This large central data repository has the potential to become messy. Therefore, it is important to follow some key guidelines to store the data. **All datasets in this folder should be contained within their own folders that include at minimum the data and metadata files.** For example, a file structure for the two datasets mentioned above might be:
![](images/shared_drive_data.png)
<!-- ``` -->
<!-- <UNK> -->
<!-- |__ ... [other folders] -->
<!-- |__ Data -->
<!-- |__ upsides -->
<!-- | |__ _readme_upsides.txt [the metadata] -->
<!-- | |__ upsides.csv [the data] -->
<!-- |__ ram -->
<!-- | |__ _readme_RAM.txt [the metadata] -->
<!-- | |__ RAM v4.10 [the data] -->
<!-- | |__ RAM v4.15 [the data] -->
<!-- | |__ RAM v4.25 [the data] -->
<!-- | |__ RAM v4.40 [the data] -->
<!-- |__ ... other data sets ... -->
<!-- ``` -->
In the above example, the folder containing the upsides database is relatively straightforward with the metadata file and a single csv file. However, the folder containing the RAM database is more complicated as this is a dataset that is re-released every so often as a new version. Specific guidelines for organizing different types of data within the `<UNK>/data` folder are discussed in detail in section 3.
## GitHub Structure
### Project Repository Structure
The structure of each <UNK> repository on GitHub will likely vary depending on the needs of the project, but the following structure is suggested as a starting point.
![](images/github_project.png)
<!-- ``` -->
<!-- Github -->
<!-- |__ <UNK>-ucsb [organization] -->
<!-- |__ example-project [repository] -->
<!-- |__ documents -->
<!-- |__ results -->
<!-- |__ scripts -->
<!-- |__ functions -->
<!-- |__ ... other folders as needed... -->
<!-- ``` -->
A `documents` (or `docs`) folder may be useful for storing code files that are used to generate text-based documents or presentations. Types of files that might live here include things like markdown files.
A `results` folder may be useful for storing plots or other types of results generated by the project. Some discretion needs to be used here, as some results may actually be considered to be "processed" or "output" data. However, results in the form of figures or workspace image files might live here.
A `scripts` folder may be useful for storing the code files that do everything from processing the raw data to running the analysis and generating outputs.
A `functions` folder may be useful for storing the code files in which functions that are used by many scripts many be stored.
Different types of projects may require more or fewer folders and these are only meant to act as suggestions. Regardless, the structure of the repository should be sufficiently organized such that it can be easily navigated and understood by others by the time the project is completed.
### A repo inside a repo (a good thing to have)
Sometimes, a project may have more than one paper or analysis sections. On some corner scenarios, we might want to have multiple "paper folders" within a "project folder". This would imply that we will have a repo inside a repo. If that is something that makes sense for you, your project, and your team, then `git submodules` are your solution. If you want to read more on when / how to use submodules, visit the [documentation page here](https://git-scm.com/book/en/v2/Git-Tools-Submodules).
Including submodules in your workflow is simple. Here's an example. you are working on a big project called "Blue Future". The project has six PIs, 13 Research Specialists, two PostDocs, and three PhD Students. After a long kick-off meeting, the team realizes that the project will produce two papers and a ShinyApp. You are all determined to keep everything on the same folder, but correctly categorized and organized. As such, you go to GitHub and create the following four repositories:
- Blue Future
- Paper 1
- Paper 2
- ShinyApp
You'll clone the Blue Future repo into your comuter, using the usual:
```
git clone https://github.com/gaelic-algorithmic-research-group/BlueFuture.git
```
Now, instead of cloning the repos for each paper and the app into their own folder, you'll navigate into your local BlueFuture folder. Then, instead of cloning them there, you can just do:
```
git submodule add https://github.com/gaelic-algorithmic-research-group/Paper1.git
```
This will clone the Paper 1 repo, but not without first telling the BlueFuture repo about it (just so that you don't end up tracking things twice). You can repeat the operation for `Paper2` and `ShinyApp`. That's it!