adding-data
+Adding data
@@ -64,112 +65,157 @@adding-data
-Adding a new dataset to the package
+Guide to contributing new datasets
+
+Package folder structure
+
+To below is a simplified explanation of the R packages data chapter. For a
+fuller understanding, read that chapter.
-- guideline for contributing datasets similar to tidytuesday
-
-
-
-
-repo folder structure
-
-Read this r packages
-chapter for details.
-.
-├── appliedepidata.Rproj
-├── data
-│ └── newdata.rda
-├── data-raw
-│ └── newdata.R
-├── inst
-│ └── extdata
-│ └── tableoftables.xlsx
-│ └── newdata.xlsx
-├── R
-│ └── newdata_doc.R
-└── man
- └── newdata.Rd
-
-- folder structure
+
- The following package folders are important:
-
-data R datasets go in data folder
+data: R datasets go in data
+folder
-
-inst/extdataNon-R datasets go in inst>extdata
-
-- option for adding messy datasets. However this would then not have
-an internalised .rda version as described below.
-
-
+inst/extdata: Non-R datasets go in
+inst>extdata
folder.
-
-internal data additional complication is that when
-you build a package, you can make the Rda datasets (from data folder)
-“internal” (more efficient for file storage, as then become part of
-binary), and these are then access by doing package::dataset. They can
-also be imported directly from github using link to the file in data
-folder e.g. ‘rio(
)’.
+internal data: When you build a package, the
+Rda
datasets (from data
folder) can become
+“internal” (more efficient for file storage). These are accessed by
+calling package::dataset
(e.g.
+appliedepidata::AJS_AmTiman
. They can also be imported
+directly from github using link to the file in data folder
+e.g. ‘rio()’, or the
+appliedepidata::get_data
or
+appliedepidata::save_data
functions.
-
-data-raw usually contains R scripts which are used
-for creating the exported or internal data (e.g. if have edited dataset,
-or where {usethis} internalises dataset)
+data-raw: Contains R scripts used for creating the
+exported or internal data (e.g. if you have edited a dataset or used
+{usethis} to internalise the dataset)
-
-sysdata probably the tableoftables should
-just stay in extdata but alternatively could go in sysdata
-which is not exported (i.e. just for package usage)
+sysdata: Not relevant for current package setup. In
+some setups you are supposed to put tableoftables in sysdata
+(i.e. just for package usage). However for our current setup leave in
+extdata.
-
-if a non-R (not .rda) file
-
-
-- name file what want but stick to guidelines described in metadata
-below
-- versioning of datasets (and edits thereof, e.g. same data for diff
-bits of course)
-- define overarching source file and resulting child files (as per
-metadata below)
-- place file in ‘inst/extdata’ folder
-- if adding a shapefile then zip it
-- in console run
-’usethis::use_data_raw(
)
-
-- read in the file with ‘rio’ and ‘system.file’
-- make edits as needed
-- save with usethis::usedata()
-- add documentation for each data set added (ideally group these
-within the same file if part of the same group of datasets)
-- ideally: data dictionary used to fill in the man files for
-each (link to function)
-- probably do the same way alison horst
-does
-- licensing: while the overall package repo will be GPL3, it is
-possible that individual datasets will come under a different license
-(so there needs to be a license section in documentation for each
-dataset)
-
+
-if adding an .rda file:
+Adding a file
-
-- link to the appropriate part from non-R onwards where relevant
-
+
This describes the process for adding a file to the repo. Note that
+the processes for adding a non-R file (any file that is not
+.rda
) and an R file (any file already in.rda
+format) are slightly different. If you are adding a dataset from an
+existing R package, you can skip to step 3 below.
+
+- Name your file appropriately
+
+
+- You can name it whatever you want, but stick to basic naming
+conventions.
+
+- Ensure that there is not already file in tableoftables.xlsx
+named the same.
+- Avoid generic names like:
linelist_cleaned.xlsx
or
+survey_data.xlsx
.
+- Use consistent and descriptive names without spaces (e.g.,
+
AJS_AmTiman
, sitrep_mortality_survey
).
+
+
+- Place your file in the correct folder
+
+
+- A non-R file (e.g.
xlsx
, shp
,
+zip
) goes in inst/extdata
folder i. If adding
+a shapefile then zip it
+- An R file (e.g.
rda
, rds
) goes in
+data
folder
+
+
+- Reproducibly edit dataset and internalise (see
+
data-raw/AJS_AmTiman.R
for example)
+
+
+- In your console run
+
usethis::use_data_raw(<name of your file without extension>)
+
+- This creates an R script in the
data-raw
folder.
+- Read in the file by defining the path with
system.file
.
+i. If you are editing a file already in the package (e.g. shortening the
+Ebola linelist for a course), make sure you read in the original dataset
+here. Document this properly with {roxygen} and in the metadata as
+described below.
+- Make any edits necessary to your dataset in a reproducible way.
+- Save and internalise the dataset with
+
usethis::usedata()
.
+
+
+- Add documentation for each dataset added
+
+
+- This is done in an R script in the
R
folder.
+- Name the script something that will allow reviewers to find it
+(e.g.
AJS_chad
) and suffix with _doc
so that
+it can be differentiated from functions.
+- Place all the documentation for datasets in that group within the
+same script.
+- Ensure to clearly document the source and license for the
+dataset.
+- Add in an explanation for each variable, if you have a data
+dictionary you use appliedepidata::create_desc()
+to help with this. i. You could also create a data dictionary for use
+with this function, see the data
+dictionary walk-through
+
+
+
+- Add the datasets to
_pkgdown.yml
+
+
+
+- Group relevant datasets under the same subtitle (suffix with the
+language)
+- The names here correspond to the name in quotations at the end of
+your description file from point 4 above, as well as the name of the
+file (without file extension).
+
+
+- Add the dataset to the
tablesoftables.xlsx
as described
+below.
+
-Data Entry Guide for Dataset Metadata
+Defining dataset metadata (adding to
+tablesoftables.xlsx
)
Below is a table explaining how to fill in each variable in the
-dataset metadata Excel sheet (1tablesoftables.xlsx). This guide helps
-ensure consistency and completeness when adding new datasets to your
-collection.
+dataset metadata Excel sheet (tablesoftables.xlsx
). This
+guide helps ensure consistency and completeness when adding new datasets
+to your collection.
name: The filename of the dataset as it appears
in the inst/extdata
directory, without the
diff --git a/articles/available-data.html b/articles/available-data.html
index 0df2207..e71d643 100644
--- a/articles/available-data.html
+++ b/articles/available-data.html
@@ -5,12 +5,12 @@
-
available-data • appliedepidata
+Available data • appliedepidata
-
+
Skip to contents
@@ -33,8 +33,9 @@
-
@@ -56,7 +57,7 @@
- available-data
+ Available data
@@ -65,16 +66,12 @@ available-data
-
-
-Available datasets
-
-You can see all available datasets below.
+You can see all available datasets below. This currently only lists
+English language names however you can use the
+appliedepidata::search_data()
to search for data sets in
+other languages available in the package.
-
-
-
+
diff --git a/articles/data-dictionaries.html b/articles/data-dictionaries.html
new file mode 100644
index 0000000..38f3121
--- /dev/null
+++ b/articles/data-dictionaries.html
@@ -0,0 +1,129 @@
+
+
+
+
+
+
+
+Data dictionaries • appliedepidata
+
+
+
+
+
+
+
+ Skip to contents
+
+
+
+
+
+
+
+
+
+
+ Data dictionaries
+
+
+
+ data-dictionaries.Rmd
+
+
+
+
+
+This is a brief walk-through of how to create a data dictionary for
+an existing dataset. If you are intending to add this to the package
+then you must follow the guidelines
+for adding datasets. If you were hoping to use the dictionary with
+appliedepidata::create_desc()
, then you can add your
+variables descriptions to the note
column. Similarly this
+dictionary could be used for translating your dataset, by adding columns
+to the excel sheet with the language suffix (e.g. _fr
) and
+then translating the content, you could then use {matchmaker} to
+recode.
+
+# Define the path to the Excel file in inst/extdata
+import_path <- system.file("extdata", "AJS_AmTiman.xlsx", package = "appliedepidata")
+
+# Define the path to export the Excel file (Dictionary) to in inst/extdata
+export_path <- file.path("inst", "extdata", "AJS_AmTiman_dict.xlsx")
+
+
+# Read in the Excel file using rio
+AJS_AmTiman <- rio::import(import_path)
+
+
+# create variable list
+survey <- datadict::dict_from_data(AJS_AmTiman)
+# add in a notes column that can then be edited manually to describe variables
+survey$note <- NA
+
+# create list of variable values (choices)
+choices <- datadict::coded_options(survey)
+
+# chuck in list
+data_export <- list(
+ survey,
+ choices
+)
+
+# write to excel sheet
+rio::export(data_export, export_path)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/articles/index.html b/articles/index.html
index 9e7ddd2..9edeba8 100644
--- a/articles/index.html
+++ b/articles/index.html
@@ -18,8 +18,9 @@
diff --git a/authors.html b/authors.html
index 32fa3e1..b5eb0d6 100644
--- a/authors.html
+++ b/authors.html
@@ -18,8 +18,9 @@
@@ -66,53 +67,10 @@ Installationpak::pak("appliedepi/appliedepidata")
Package folder structure +
+To below is a simplified explanation of the R packages data chapter. For a +fuller understanding, read that chapter.
-
-
- guideline for contributing datasets similar to tidytuesday - -
repo folder structure -
-Read this r packages -chapter for details.
-.
-├── appliedepidata.Rproj
-├── data
-│ └── newdata.rda
-├── data-raw
-│ └── newdata.R
-├── inst
-│ └── extdata
-│ └── tableoftables.xlsx
-│ └── newdata.xlsx
-├── R
-│ └── newdata_doc.R
-└── man
- └── newdata.Rd
-
-
- folder structure +
- The following package folders are important:
- -data R datasets go in data folder +data: R datasets go in
data
+folder -
-inst/extdataNon-R datasets go in inst>extdata
-
-
-
- option for adding messy datasets. However this would then not have
-an internalised .rda version as described below.
-
-
inst>extdata
folder.
- option for adding messy datasets. However this would then not have
-an internalised .rda version as described below.
-
-internal data additional complication is that when
-you build a package, you can make the Rda datasets (from data folder)
-“internal” (more efficient for file storage, as then become part of
-binary), and these are then access by doing package::dataset. They can
-also be imported directly from github using link to the file in data
-folder e.g. ‘rio(
)’. +internal data: When you build a package, the +Rda
datasets (fromdata
folder) can become +“internal” (more efficient for file storage). These are accessed by +callingpackage::dataset
(e.g. +appliedepidata::AJS_AmTiman
. They can also be imported +directly from github using link to the file in data folder +e.g. ‘rio()’, or the + appliedepidata::get_data
or +appliedepidata::save_data
functions. - -data-raw usually contains R scripts which are used -for creating the exported or internal data (e.g. if have edited dataset, -or where {usethis} internalises dataset) +data-raw: Contains R scripts used for creating the +exported or internal data (e.g. if you have edited a dataset or used +{usethis} to internalise the dataset)
- -sysdata probably the tableoftables should -just stay in extdata but alternatively could go in sysdata -which is not exported (i.e. just for package usage) +sysdata: Not relevant for current package setup. In +some setups you are supposed to put tableoftables in sysdata +(i.e. just for package usage). However for our current setup leave in +extdata.
if a non-R (not .rda) file -
--
-
- name file what want but stick to guidelines described in metadata -below -
- versioning of datasets (and edits thereof, e.g. same data for diff -bits of course) -
- define overarching source file and resulting child files (as per -metadata below) -
- place file in ‘inst/extdata’ folder -
- if adding a shapefile then zip it -
- in console run
-’usethis::use_data_raw(
) -
- - read in the file with ‘rio’ and ‘system.file’ -
- make edits as needed -
- save with usethis::usedata() -
- add documentation for each data set added (ideally group these -within the same file if part of the same group of datasets) -
- ideally: data dictionary used to fill in the man files for -each (link to function) -
- probably do the same way alison horst -does -
- licensing: while the overall package repo will be GPL3, it is -possible that individual datasets will come under a different license -(so there needs to be a license section in documentation for each -dataset) -
if adding an .rda file:
+Adding a file
-
-- link to the appropriate part from non-R onwards where relevant
-
+
This describes the process for adding a file to the repo. Note that
+the processes for adding a non-R file (any file that is not
+.rda
) and an R file (any file already in.rda
+format) are slightly different. If you are adding a dataset from an
+existing R package, you can skip to step 3 below.
-
+
- Name your file appropriately +
-
+
- You can name it whatever you want, but stick to basic naming
+conventions.
+
+ - Ensure that there is not already file in tableoftables.xlsx +named the same. +
- Avoid generic names like:
linelist_cleaned.xlsx
or +survey_data.xlsx
.
+ - Use consistent and descriptive names without spaces (e.g.,
+
AJS_AmTiman
,sitrep_mortality_survey
).
+
-
+
- Place your file in the correct folder +
-
+
- A non-R file (e.g.
xlsx
,shp
, +zip
) goes ininst/extdata
folder i. If adding +a shapefile then zip it
+ - An R file (e.g.
rda
,rds
) goes in +data
folder
+
-
+
- Reproducibly edit dataset and internalise (see
+
data-raw/AJS_AmTiman.R
for example)
+
-
+
- In your console run
+
usethis::use_data_raw(<name of your file without extension>)
+
+ - This creates an R script in the
data-raw
folder.
+ - Read in the file by defining the path with
system.file
. +i. If you are editing a file already in the package (e.g. shortening the +Ebola linelist for a course), make sure you read in the original dataset +here. Document this properly with {roxygen} and in the metadata as +described below.
+ - Make any edits necessary to your dataset in a reproducible way. +
- Save and internalise the dataset with
+
usethis::usedata()
.
+
-
+
- Add documentation for each dataset added +
-
+
- This is done in an R script in the
R
folder.
+ - Name the script something that will allow reviewers to find it
+(e.g.
AJS_chad
) and suffix with_doc
so that +it can be differentiated from functions.
+ - Place all the documentation for datasets in that group within the +same script. +
- Ensure to clearly document the source and license for the +dataset. +
- Add in an explanation for each variable, if you have a data +dictionary you use appliedepidata::create_desc() +to help with this. i. You could also create a data dictionary for use +with this function, see the data +dictionary walk-through + +
-
+
- Add the datasets to
_pkgdown.yml
+
+
-
+
- Group relevant datasets under the same subtitle (suffix with the +language) +
- The names here correspond to the name in quotations at the end of +your description file from point 4 above, as well as the name of the +file (without file extension). +
-
+
- Add the dataset to the
tablesoftables.xlsx
as described +below.
+
Data Entry Guide for Dataset Metadata
+Defining dataset metadata (adding to
+tablesoftables.xlsx
)
tablesoftables.xlsx
)
Below is a table explaining how to fill in each variable in the -dataset metadata Excel sheet (1tablesoftables.xlsx). This guide helps -ensure consistency and completeness when adding new datasets to your -collection.
+dataset metadata Excel sheet (tablesoftables.xlsx
). This
+guide helps ensure consistency and completeness when adding new datasets
+to your collection.
name: The filename of the dataset as it appears in the
inst/extdata
directory, without the diff --git a/articles/available-data.html b/articles/available-data.html index 0df2207..e71d643 100644 --- a/articles/available-data.html +++ b/articles/available-data.html @@ -5,12 +5,12 @@ -available-data • appliedepidata +Available data • appliedepidata - + Skip to contents @@ -33,8 +33,9 @@
available-data
+Available data
@@ -65,16 +66,12 @@available-data
- -Available datasets -
-You can see all available datasets below.
+You can see all available datasets below. This currently only lists
+English language names however you can use the
+appliedepidata::search_data()
to search for data sets in
+other languages available in the package.
Data dictionaries
+ + + +data-dictionaries.Rmd
This is a brief walk-through of how to create a data dictionary for
+an existing dataset. If you are intending to add this to the package
+then you must follow the guidelines
+for adding datasets. If you were hoping to use the dictionary with
+appliedepidata::create_desc()
, then you can add your
+variables descriptions to the note
column. Similarly this
+dictionary could be used for translating your dataset, by adding columns
+to the excel sheet with the language suffix (e.g. _fr
) and
+then translating the content, you could then use {matchmaker} to
+recode.
+# Define the path to the Excel file in inst/extdata
+import_path <- system.file("extdata", "AJS_AmTiman.xlsx", package = "appliedepidata")
+
+# Define the path to export the Excel file (Dictionary) to in inst/extdata
+export_path <- file.path("inst", "extdata", "AJS_AmTiman_dict.xlsx")
+
+
+# Read in the Excel file using rio
+AJS_AmTiman <- rio::import(import_path)
+
+
+# create variable list
+survey <- datadict::dict_from_data(AJS_AmTiman)
+# add in a notes column that can then be edited manually to describe variables
+survey$note <- NA
+
+# create list of variable values (choices)
+choices <- datadict::coded_options(survey)
+
+# chuck in list
+data_export <- list(
+ survey,
+ choices
+)
+
+# write to excel sheet
+rio::export(data_export, export_path)
stuff to consider
+Usage
-
-functions
-
-
-
--
-
-
-
-
-
--
-
-
-- ?necessary - rather than have new users freakout about different ways of accessing the data)
-
-
-
--
-
-
-- if decide to help w/ {datadict} then {epidict} would just become a fake data generator.
-- odk example dictionary from xlsxform
-
-
-
-
-
--
-
-
-- ?pull snapshots bit from {gapminder} re changes to datasets
-
-
-
-
-
+
functions -
--
-
-
-
-
-
-
-
-
- -
-
-
-
-
- ?necessary - rather than have new users freakout about different ways of accessing the data) -
-
- -
-
-
-
-
- if decide to help w/ {datadict} then {epidict} would just become a fake data generator. -
- odk example dictionary from xlsxform - -
-
-
- -
-
-
-
-
- ?pull snapshots bit from {gapminder} re changes to datasets -
-
You can see available datasets in the available data table as well as in depth descriptions in the reference section. Within RStudio you can use the search_data do browse available datasets.
+Using that information you can either load data in to your RStudio environment using get_data or save a copy of the data to your computer using save_data.