Please provide feedback via GitHub issues or a pull request.
GitHub repository: https://github.com/nfdi4plants/ARC-specification
This specification is Copyright 2022 by DataPLANT.
Licensed under the Creative Commons License CC BY, Version 4.0; you may not use this file except in compliance with the License. You may obtain a copy of the License at https://creativecommons.org/about/cclicenses/. This license allows re-users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. Credit must be given to the creator.
- Annotated Research Context Specification, v2.0
- Table of Contents
- Introduction
- ARC Structure and Content
- Shareable and Publishable ARCs
- Reproducible ARCs
- Mechanisms for ARC Quality Control
- Best Practices
- Appendix: Conversion of ARCs to RO Crates
This document describes a specification for a standardized way of creating a working environment and packaging file-based research data and necessary additional contextual information for working, collaboration, preservation, reproduction, re-use, and archiving as well as distribution. This organization unit is named Annotated Research Context (ARC) and is designed to be both human and machine actionable.
ARCs are digital objects that fulfill all FAIR principles and are therefore referred to as FAIR Digital Objects (FDO).
An ARC is intended to capture research data, analysis and metadata and their evolution in scenarios ranging from single experimental setups to complex experimental designs in plant biological research. Its design intent is to assist researchers in meeting FAIR requirements, and also minimize the workload for doing so. ARCs are self-contained and include study materials, assay and measurement data, workflow, and computation outputs, accompanied by metadata and history, in one package. Toward this, ARCs combine existing standards, leveraging the properties of the ISA metadata model, for administrative and experimental metadata and the Common Workflow Language (CWL) to represent processing specifications.
ARCs are furthermore designed to enable straightforward conversion to other types of research data archives, such as e.g. Research Object Crates, to facilitate straightforward operation with widely used data repositories (e.g. PRIDE, GEO, ENA). Therefore, ARCs aggregate administrative, experimental, and workflow metadata within a common structure.
This specification is intended as a practical guide for software authors to create tools for generating and consuming research data packages.
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119. This specification is based on the ISA model and the Common Workflow Specification (v1.2).
The ARC specification can be extended in a backwards compatible way and will evolve over time. This is accomplished through a community-driven ARC discussion forum and pull request mechanisms.
All changes that are not backwards compatible with the current ARC specification will be implemented in ARC specification v2.0.
ARCs are based on a strict separation of data and metadata content into study material (studies), measurement and assay outcomes (assays), computation results (runs) and computational workflows (workflows) generating the latter. The scope or granularity of an ARC aligns with the necessities of individual projects or large experimental setups.
Each ARC is a directory containing the following elements:
-
Studies are collections of material and resources used within the investigation. Study-level metadata is stored in ISA-XLSX format in a
isa.study.xlsx
file, which MUST exist to specify the input material or data resources. Resources MAY include biological materials (e.g. plant samples, analytical standards) created during the current investigation. Resources MAY further include external data (e.g., knowledge files, results files) that need to be included and cannot be referenced due to external limitations. Resources described in a study file can be the input for one or multiple assays. Further details onisa.study.xlsx
are specified below. Resource (descriptor) files MUST be placed in aresources
subdirectory. Further explications about data entities defined in the study are stored in ISA-XLSX format in aisa.datamap.xlsx
file, which SHOULD exist for studies containing data. Further details onisa.datamap.xlsx
are specified in the isa-xlsx specification. -
Assays correspond to outcomes of experimental assays or analytical measurements (in the interpretation of the ISA model) and are treated as immutable data. Each assay is a collection of files, together with a corresponding metadata file, stored in a subdirectory of the top-level subdirectory
assays
. Assay-level metadata is stored in ISA-XLSX format in aisa.assay.xlsx
file, which MUST exist for each assay. Further details onisa.assay.xlsx
are specified below. Assay data files MUST be placed in adataset
subdirectory. Further explications about data entities defined in the assay are stored in ISA-XLSX format in aisa.datamap.xlsx
file, which SHOULD exist for each assay. Further details onisa.datamap.xlsx
are specified in the isa-xlsx specification. -
Workflows represent data analysis routines (in the sense of CWL tools and workflows) and are a collection of files, together with a corresponding CWL description, stored in a single directory under the top-level
workflows
subdirectory. A per-workflow executable CWL description is stored inworkflow.cwl
, which MUST exist for all ARC workflows. Further details on workflow descriptions are given below. -
Runs capture data products (i.e., outputs of computational analyses) derived from assays, other runs, or study materials using workflows (located in the aforementioned workflows subdirectory). Each run is a collection of files, stored in the top-level
runs
subdirectory. It MUST be accompanied by a per-run CWL workflow description, stored in<run_name>.cwl
as further described below. -
Top-level metadata and workflow description tie together the elements of an ARC in the contexts of investigation and associated studies (in the ISA definition), captured in the file
isa.investigation.xlsx
in ISA-XLSX format, which MUST be present. Furthermore, top-level reproducibility information SHOULD be provided in the CWLarc.cwl
.
All other files contained in an ARC (e.g., a README.txt
, pre-print PDFs, additional annotation files) are referred to as additional payload, and MAY be located anywhere within the ARC structure. However, an ARC MUST be reproducible and publishable even if these files are deleted. Further considerations on additional payload are described below.
Note:
- Subdirectories and other files in the top-level
studies
,assays
,workflows
, andruns
directories are viewed as additional payload unless they are accompanied by the corresponding mandatory description (isa.study.xlsx
,isa.assay.xlsx
,workflow.cwl
,run.cwl
) specified below. This is intended to allow gradual migration from existing data storage schemes to the ARC schema. For example, data files for an assay may be stored in a subdirectory ofassays/
, but are only identified as an assay of the ARC if metadata is present and complete, including a reference from top-level metadata.
<top-level directory>
| isa.investigation.xlsx
| arc.cwl [optional]
| arc.yml [optional]
\--- studies
\--- <study_name>
| isa.study.xlsx
| isa.datamap.xlsx [optional]
\--- resources
\--- protocols [optional / add. payload]
\--- assays
\--- <assay_name>
| isa.assay.xlsx
| isa.datamap.xlsx [optional]
\--- dataset
\--- protocols [optional / add. payload]
\--- workflows
\--- <workflow_name>
| workflow.cwl
| docker-compose.yml [optional / add. payload]
\--- runs
\--- <run_name>
| [files;...] (different output files)
| run.cwl
| run.yml [optional]
ARCs are Git repositories, as defined and supported by the Git C implementation (version 2.26 or newer) with Git-LFS extension (version 2.12.0), or fully compatible implementations.
ARC terminology implicitly borrows from Git and Git-LFS terminology. For example, an ARC commit is simply a Git commit, and the ARC history is the repository history. Furthermore, an ARC can contain multiple branches, etc.
Tree objects (resp. directories) and blobs (i.e., files) of all branch heads in the repository MUST adhere to the ARC schema. ARCs allow all typical Git operations (e.g. clone, branch).
All representation suitable for Git-LFS repositories are also valid representations of ARCs. This includes both bare repositories (without a checked out working copy) and non-bare repositories (i.e. a .git
directory with one or more attached working copies). In particular, it is possible and intended to maintain ARCs on local user filesystems and via Git repository hosting services. No requirements are made for state and contents of working copies.
Notes:
-
Archival representation (e.g.
.zip
or.tar.gz
) are valid ARC representations if archives are created to preserve file attributes, i.e. if unarchiving preserves Git interoperability. Furthermore, Git's bundle mechanism can be used to create archives of complete ARCs or individual branches. For archiving purposes,git bundle create --all
or an equivalent should be used. -
Elements of an ARC are implicitly content-addressable using standard Git mechanisms via SHA1 hashes.
-
Removing the
.git
top-level subdirectory (and thereby all provenance information captured within the Git history) from a working copy invalidates an ARC.
The ISA-XLSX specification is currently part of the ARC specification. Its version therefore follows the version of the ARC specification.
https://github.com/nfdi4plants/ARC-specfication/blob/main/ISA-XLSX.md
The characteristics of all material and resources used within the investigation must be specified in a study. Studies must be placed into a unique subdirectory of the top-level studies
subdirectory. All ISA metadata specific to a single study MUST be annotated in the file isa.study.xlsx
at the root of the study's subdirectory. This workbook MUST contain a single resources description that can be organized in one or multiple worksheets.
The study
file MUST follow the ISA-XLSX study file specification.
Material or experimental samples can be stored in the form of virtual sample files (containing unique identifiers) in the resources directory. Each external data file can be interpreted as a virtual sample and stored accordingly under resources. External data refers to data that is neither originating within the investigation scope of the ARC nor can be referenced externally, but is required to ensure reproducibility.
Protocols that are necessary to describe the sample or material creating process can be placed under the protocols directory.
Further explications about data entities defined in the assay MAY be stored in ISA-XLSX format in a isa.datamap.xlsx
file, which SHOULD exist for each assay. Further details on isa.datamap.xlsx
are specified in the isa-xlsx specification.
All measurement data sets are considered as assays and are considered immutable input data. Assay data MUST be placed into a unique subdirectory of the top-level assays
subdirectory. All ISA metadata specific to a single assay MUST be annotated in the file isa.assay.xlsx
at the root of the assay's subdirectory. This workbook MUST contain a single assay that can be organized in one or multiple worksheets.
The assay
file MUST follow the ISA-XLSX assay file specification.
Further explications about data entities defined in the assay MAY be stored in ISA-XLSX format in a isa.datamap.xlsx
file, which SHOULD exist for each assay. Further details on isa.datamap.xlsx
are specified in the isa-xlsx specification.
Notes:
-
There are no requirements on specific assay-level metadata per formal ARC definition. Conversion of ARCs into other repository or archival formats (e.g. PRIDE, GEO, ENA) may however mandate the presence of specific terms required in the destination format.
-
To ensure reusability of assays, it is strongly RECOMMENDED to include necessary metadata mandated by typical metadata schemes necessary for reproduction. This process is facilitated by the use of templates that can be found here.
-
It is RECOMMENDED to order worksheets according to the input-output-relation for readability.
-
It is RECOMMENDED to adopt the structure outlined below to organize assay data files and other supporting information.
-
An implementation that ensures assay annotation consistent with these requirements is provided by the SWATE tool.
-
While assays MAY in principle contain arbitrary data formats, it is highly RECOMMENDED to use community-supported, open formats (see Best Practices).
Workflows in ARCs are computational steps that are used in computational analysis of an ARC's assays and other data transformation to generate a run result. Typical examples include data cleaning and preprocessing, computational analysis, or visualization. Workflows are used and combined to generate run results, and allow reuse of processing steps across multiple run results.
Workflow execution and metadata MUST be described using the Common Workflow Language (CWL), v1.2 or higher, in a file workflow.cwl
, which MUST be placed in the subdirectory containing all files specific to this workflow under the top-level workflows
subdirectory. This file MUST contain either of:
-
A CWL tool description. Tool descriptions must be self-contained and not refer to any files outside the workflow subdirectory. All paths used within the tool description MUST be relative to itself.
-
A CWL workflow description. Such descriptions MAY utilize other ARC workflows as nested workflows, but MUST use relative paths in this case. Files outside the ARC root directory MUST NOT be referenced.
Notes:
-
There are no requirements on the structure or granularity of workflows. An ARC may contain no workflows at all if it contains no run results, or MAY utilize a single workflow to generate a single run result containing all computational output.
-
While workflows typically are (and should be) generic, i.e. a single workflow can be applied to different data of the same type, this is not a requirement. It is allowed to hard-code assay file paths and other parameters if workflow reusability is not a priority.
-
It is highly recommended that tool descriptions contain a reproducible execution environment description in the form of a Docker container description.
-
It is expected that workflow and tool descriptions are authored semi-automatically, e.g. using the arcCommander tool.
-
It is strongly encouraged to include author and contributor metadata in tool descriptions and workflow descriptions as CWL metadata.
Runs in an ARC represent all artefacts that result from some computation on the data within the ARC, i.e. assays and external data. These results (e.g. plots, tables, data files, etc. ) MUST reside inside one or more subdirectory of the top-level runs
directory.
Each such subdirectory must contain a workflow description run.cwl
, given in Common Workflow Language (CWL), v1.2 or higher, that describes how the files contained with the run are derived from assay or external data, or other runs. run.cwl
MUST be placed in the subdirectory under the top-level runs
directory. A parameter file run.yml
MAY be given to specify run-specific input parameters.
run.cwl
MAY (and sensibly, should) refer to assay data files, external data files, workflow descriptions, and files in other run results; such references MUST use relative paths. Furthermore, run.cwl
MUST specify as outputs all result files. run.cwl
MUST BE executable without referring to additional payload files or files outside the ARC.
Notes:
-
Run descriptions are intended to ensure that the computational analysis encapsulated within an ARC can be fully reproduced.
-
Any files produced by executing the run description which are not specified as CWL outputs in
run.cwl
are considered additional ARC payload. Furthermore, all files of all subdirectories underrun
that are not referenced from the top-level workflow are considered additional payload. -
It is expected that run descriptions are authored semi-automatically, e.g. using the arcCommander tool.
-
It is strongly encouraged to include author and contributor metadata in run descriptions as CWL metadata.
ARCs can include additional payload according to user requirements, e.g. presentations, reading material, or manuscripts. While these files can be placed anywhere in the ARC, it is strongly advised to organize these in additional subdirectories. Especially for the storage of protocols, it is RECOMMENDED to place protocols (assay SOPs) in text form with the corresponding assay in /assays/<assay_name>/protocol/<protocol_name>.
Note:
- All data missing proper annotation (e.g. studies, assays, workflows or runs) is considered additional payload independent of its location within the ARC.
Top-level metadata and workflow description tie together the elements of an ARC in the contexts of an investigation captured in the isa.investigation.xlsx
file, which MUST be present.
The investigation
file MUST follow the ISA-XLSX investigation file specification.
Furthermore, top-level reproducibility information SHOULD be provided in the CWL arc.cwl
.
The ARC root directory is identifiable by the presence of the isa.investigation.xlsx
file in XLSX format. It contains top-level information about the investigation and MUST link all assays and studies within an ARC. Study and assay objects are registered and grouped with an investigation to record other metadata within the relevant contexts.
The file arc.cwl
SHOULD exist at the root directory of each ARC. It describes which runs are executed (and specifically, their order) to (re)produce the computational outputs contained within the ARC.
arc.cwl
MUST be a CWL v1.2 workflow description and adhere to the same requirements as run descriptions. In particular, references to study or assay data files, nested workflows MUST use relative paths. An optional file arc.yml
MAY be provided to specify input parameters.
All metadata references to files or directories located inside the ARC MUST follow the following patterns:
-
The
general pattern
, which is universally applicable and SHOULD be used to specify the path relative to the ARC root. -
The
folder specific pattern
, which MAY be used only in specific metadata contexts:- Data nodes in
isa.assay.xlsx
files: The path MAY be specified relative to thedataset
sub-folder of the assay - Data nodes in
isa.study.xlsx
files: The path MAY be specified relative to theresources
sub-folder of the study
- Data nodes in
In this example, there are two assays
, with Assay1
containing a measurement of a Source
material, producing an output Data
. Assay2
references this Data
for producing a new Data
.
Use of general pattern
relative paths from the arc root folder:
assays/Assay1/isa.assay.xlsx
:
Input [Source Name] | Component [Instrument model] | Output [Data] |
---|---|---|
input | Bruker 500 Avance | assays/Assay1/dataset/measurement.txt |
assays/Assay2/isa.assay.xlsx
:
Input [Data] | Component [script file] | Output [Data] |
---|---|---|
assays/Assay1/dataset/measurement.txt | assays/Assay2/dataset/script.sh | assays/Assay2/dataset/result.txt |
In this example, there are two assays
, with Assay1
containing a measurement of a Source
material, producing an output Data
. Assay2
references this Data
for producing a new Data
.
Use of folder specific pattern
relative paths from Assay1
and Assay2
Dataset
folders, respectively:
assays/Assay1/isa.assay.xlsx
:
Input [Source Name] | Component [Instrument model] | Output [Data] |
---|---|---|
input | Bruker 500 Avance | measurement.txt |
assays/Assay2/isa.assay.xlsx
:
Input [Data] | Component [script file] | Output [Data] |
---|---|---|
assays/Assay1/dataset/measurement.txt | script.sh | result.txt |
Note, that to reference Data
which is part of Assays1
in Assay2
, the general pattern
is necessary either way. Therefore it is considered the more broadly applicable and recommended pattern.
ARCs can be shared in any state. They are considered publishable (e.g. for the purpose of minting a DOI) when fulfilling the following conditions:
-
Investigation-level (administrative) metadata contains minimally the following terms:
- Investigation Identifier
- Investigation Title
- Investigation Description
- INVESTIGATION CONTACTS section and/or Comment[ORCID] of the PI(s)
- Investigation Person Last Name
- Investigation Person First Name
- Investigation Person Mid Initials
- Investigation Person Email
- Investigation Person Affiliation
-
A publishable ARC MUST NOT be empty: it MUST contain minimally a single assay or a single workflow.
-
A publishable ARC MUST be reproducible
Notes:
-
The attribute publishable does not imply that data and metadata contained in an ARC are suitable for publication in a specific outlet (e.g. PRIDE, GEO, EBI) nor that metadata is complete or enables reusability of data. While it may be straightforward to convert the ARC schema into one required by specific publishers or repositories, additional metadata requirements may be enforced during conversion. These are intentionally not captured in this specification.
-
As noticed above experimental metadata necessary for publication in a specific outlet is encoded by templates that can be found here.
-
Minimal administrative metadata ensure compliance with DataCite for DOI creation
Reproducibility of ARCs refers mainly to its runs. Within an ARC, it MUST be possible to reproduce the run data. Therefore, necessary software MUST be available in workflows. In the case of non-deterministic software the run results should represent typical examples.
ARCs are supposed to be living research objects and are as such never complete. Nevertheless, a mechanism to continuously report the current state and quality of an ARC is indispensable.
The process of assessing quality parameters of an ARC is further referred to as validation of the ARC against a validation package, where the validation package is an arbitrary set of validation cases that the ARC MUST pass to qualify as valid in regard to the validation package.
A validation case is the atomic unit of a validation package describing a single, deterministic and reproducible requirement that the ARC MUST satisfy in order to qualify as valid in regard to it.
Format and scope of these cases naturally vary depending on the type of ARC, aim of the containing validation package and tools used for creating and performing the validation. Therefore, no further requirements are made on the format of validation cases.
example:
The following example shows a validation case simply defined using natural language.
All Sample names in this ARC must be prefixed with the string "Sample_"
Any ARC where all sample names are prefixed with the string "Sample_" would be considered valid in regard to this validation case.
A validation package bundles a collection of validation cases that the ARC MUST pass to qualify as valid in regard to the validation package with instructions on how to perform the validation and summarize the results.
Validation packages
-
MUST be executable. This can for example be achieved by implementing them in a programming language, a shell script, or a workflow language.
-
MUST validate an ARC against all contained validation cases upon execution.
-
MUST have a globally unique name. This will eventually be enforced by a central validation package registry
-
SHOULD be versioned using semantic versioning
-
MUST be enriched with the following mandatory metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.):
Field Type Description Name string the name of the package Version string the version of the package Summary string a single sentence description (<=50 words) of the package Description string an unconstrained free text description of the package -
MAY be enriched with the following optional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.):
Field Type Description HookEndpoint string An URL to trigger subsequent events based on the result of executing the validation package in a CQC context, see Continuous quality control and ARC Apps -
MAY be enriched with any additional metadata in an appropriate way (e.g. via yaml frontmatter, tables in a database, etc.).
-
MUST create a
validation_report.*
file upon execution that summarizes the results of validating the ARC against the cases defined in the validation package. The format of this file SHOULD be of an established test result format such as JUnit XML or TAP. -
MUST create a
badge.svg
file upon execution that visually summarizes the results of validating the ARC against the validation cases defined in the validation package. The information displayed SHOULD be derivable from thevalidation_report.*
file and MUST include the Name of the validation package. -
MUST create a
validation_summary.json
file upon execution, which contains the mandatory and optional metadata specified above, and a high-level summary of the execution of the validation package following this schema:validation_summary.json schema
{ "$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "properties": { "Critical": { "type": "object", "properties": { "HasFailures": { "type": "boolean" }, "Total": { "type": "integer" }, "Passed": { "type": "integer" }, "Failed": { "type": "integer" }, "Errored": { "type": "integer" } }, "required": [ "HasFailures", "Total", "Passed", "Failed", "Errored" ] }, "NonCritical": { "type": "object", "properties": { "HasFailures": { "type": "boolean" }, "Total": { "type": "integer" }, "Passed": { "type": "integer" }, "Failed": { "type": "integer" }, "Errored": { "type": "integer" } }, "required": [ "HasFailures", "Total", "Passed", "Failed", "Errored" ] }, "ValidationPackage": { "type": "object", "properties": { "Name": { "type": "string" }, "Version": { "type": "string" }, "Summary": { "type": "string" }, "Description": { "type": "string" }, "HookEndpoint": { "type": "string" } }, "required": [ "Name", "Version", "Summary", "Description" ] } }, "required": [ "Critical", "NonCritical", "ValidationPackage" ] }
-
SHOULD aggregate the result files in an appropriately named subdirectory.
A reference implementation for creating validation cases, validation packages, and validating ARCs against them is provided in the arc-validate software suite
In addition to manually validate ARCs against validation packages, ARCs MAY be continuously validated against validation packages using a continuous integration (CI) system. This process is further referred to as Continuous Quality Control (CQC) of the ARC. CQC can be triggered by any event that is supported by the CI system, e.g. a push to a branch of the ARC repository or a pull request.
To make sure that validation results are bundled with ARCs but do not pollute their commit history, validation results MUST be stored in a separate branch of the ARC repository. This branch:
-
MUST be named
cqc
-
MUST be an orphan branch
-
MUST NOT be merged into any other branch.
-
MUST contain the following folder structure:
{$branch}/{$package}
:cqc branch root └── {$branch} └── {$package}
where:
{$branch}
is the name of the branch the validation was run on{$package}
is the name of the validation package the validation was run against. this folder then MUST contain the filesvalidation_report.*
andbadge.svg
as described in the validation package specification. This folder MAY also be suffixed by the version of the validation package via a@
character followed by the version number of the validation package:{$package}@{$version}
, e.g.[email protected]
.
example:
This example shows the validation results of the
main
andbranch-1
branches of the ARC repository against thepackage1
andpackage2
validation packages. forpackage2
, an optional version hint of the package is included in the folder name:cqc-branch-root ├── branch-1 │ ├── package1 │ │ ├── badge.svg │ │ └── validation_report.xml │ └── [email protected] │ ├── badge.svg │ └── validation_report.xml └── main ├── package1 │ ├── badge.svg │ └── validation_report.xml └── [email protected] ├── badge.svg └── validation_report.xml
Commits to the cqc
branch MUST contain the commit hash of the commit that was validated in the commit message.
The validation_packages.yml
specifies the validation packages that the branch containing the file will be validated against.
Each branch of an ARC MAY contain 0 or 1 validation_packages.yml
files.
If the file is present, it:
-
MAY contain a
specification
key which, when present, MUST contain the version of the ARC specification that the ARC should be validated against. Schema specification should be tied to specification releases, and be directly integrated into tools that can perform validation against validation packages. -
MUST be located in the
.arc
folder in the root of the ARC -
MUST contain the
validation_packages
key which is a list of validation packages that the current branch will be validated against.values of the
validation_packages
list are objects with the following fields:name
: the name of the validation package. This field is mandatory and MUST be included for each validation package object. This name MUST be unique across all validation packages object, which means that only one version of a package can be contained in the file.version
: the version of the validation package. This field is optional and MAY be included for each validation package object. If included, it MUST be a valid semantic version, restricted to MAJOR.MINOR.PATCH format. If not included, this indicates that the latest available version of the validation package will be used.
example:
This example shows a
validation_packages.yml
file that specifies that the current branch will be validated against: version2.0.0-draft
of the ARC specification, version1.0.0
ofpackage1
, version2.0.0
ofpackage2
, and the latest available version ofpackage3
.
arc_specification: 2.0.0-draft
validation_packages:
- name: package1
version: 1.0.0
- name: package2
version: 2.0.0
- name: package3
Continuous Quality Control enables to check at any time in the ARC life cycle whether it passes certain criteria or not.
However, if an ARC is valid for a given target is only half of the equation - the other being taking some kind of action based on this information. One large field of actions here is the publication of the ARC or (some) of it's contents to an endpoint repository (ER) (e.g. PRIDE, ENA).
In this example, a validation package SHOULD only determine if the content COULD be published to the ER, and a subsequent service SHOULD then take the respective action based on the reported result of that package (e.g. fixing errors based on the report, or publish the content to the ER).
ARC apps are services that provide URLs called (CQC) Hook Endpoints that be triggered manually or by the result of a validation package. They are intended to automate the process of taking action based on the result of a validation package.
PLANTDataHUB performs Continuous Quality Control of ARCs using the arc-validate software suite as described in our 2023 paper PLANTdataHUB: a collaborative platform for continuous FAIR data sharing in plant research.
The following sequence diagram shows the conceptual implementation of CQC pipelines in conjunction with ARC Apps connected via CQC Hooks on the reference DataHUB instance with the following participants:
- User: The user who works on an ARC published on the DataHUB
- ARC: The ARC repository on the DataHUB
- DataHUB: The DataHUB instance
- ARC App: A service that provides a CQC Hook Endpoint to perform actions based on validation results and/or user input
sequenceDiagram
participant User
participant ARC
participant DataHUB
participant ARC App
Note over User, DataHUB: Validation (CQC pipeline)
User ->> ARC : commit
DataHUB ->> DataHUB : trigger validation for commit
DataHUB ->> ARC : commit validation results <br> to cqc branch
DataHUB ->> ARC : create badge
Note over User, ARC App: CQC Hooks
User ->> ARC App : click on badge link
DataHUB ->> ARC App : trigger some action based on validation results
ARC App ->> DataHUB : Request relevant information
DataHUB ->> ARC App : send relevant information (when granted access)
ARC App ->> ARC App : Perform action with retrieved data
In the next section we provide you with Best Practices to make the use of an ARC even more efficient and valuable for open science.
It is recommend to use community specific data formats covering most common measurement techniques. Using the following recommended formats will ensure improved accessibility and findability:
- mzML (raw data metabolomics and proteomics)
- mzTAB (analysis data metabolomics and proteomics)
- fastq.gz (compressed NGS Short Read Sequencing, Long Read Sequencing)
- fastq (NGS Short Read Sequencing, Long Read Sequencing)
- SAM (Sequence Alignment/Map format)
- BAM (Compressed binary version of a SAM file that is used to represent aligned sequences)
Notes:
- In case of storing vendor-specific data within an ARC, it is strongly encouraged to accompany them by the corresponding open formats or provide a workflow for conversion or processing where this is possible and considered standard.
Compression is preferable to save disk space and speed up data transfers but not required. Without compression workflows are simpler as often no transparent compression and decompression is available. Uncompressed files are usually easier to index and better searchable.
Encryption is not advised (but could be an option to share sensitive data in an otherwise open ARC).
Required files defined in the ARC structure need to be named accordingly. Files and folders specified < > can be named freely. As the ARC might be used by different persons and in different workflow contexts, we recommend concise filenames without blanks and special characters. Therefore, filenames SHOULD stick to small and capital letters without umlauts, accented and similar special characters. Numbers, hyphens, and underscores are suitable as well. Modern working environments can handle blanks in filenames but might confuse automatically run scripts and thus SHOULD be avoided. Depending on the intended amount of people the ARC is shared with, certain information might prove useful to provide a fast overview in human readable form in the filename, e.g. by providing abbreviations of the project, sub project, person creating or working on a particular dataset. Date and time information might be encoded as well if it provides a better ordering or information for the particular purpose.
Research Object (RO) Crate is a lightweight approach, based on schema.org, to package research data together with their metadata.
An ARC can be augmented into an RO Crate by placing a metadata file ro-crate-metadata.json
into the top-level ARC folder, which must conform to the RO Crate specification.
The ARC root folder is then simultaneously the RO Crate Root and represents an ISA investigation.
The studies, assays and workflows are part of the investigation and linked to it using the typical RO-Crate methodology, e.g. the hasPart
property of http://schema.org/Dataset
.
All four object types follow their corresponding profiles.
It is RECOMMENDED to adhere to the following conventions when creating this file:
- The root data entity follows the ISA Investigation profile.
- The root data entity description are taken from the "Investigation Description" term in
isa.investigation.xlsx
. - The root data entity authors are taken from the "Investigation Contacts" in
isa.investigation.xlsx
. - The root data entity citations are taken from the "Investigation Publications" section in
isa.investigation.xlsx
.
- The root data entity description are taken from the "Investigation Description" term in
- For each assay and study linked from
isa.investigation.xlsx
, one dataset entity is provided inro-crate-metadata.json
. The Dataset id corresponds to the relative path of the assay ISA file underassays/
, e.g. "sample-data/isa.assay.xlsx". Other metadata is taken from the corresponding terms in the correspondingisa.assay.xlsx
orisa.study.xlsx
. - The root data entity is connected to each assay and study through the
hasPart
Property. - The assay and study entities follow the ISA Assay Profile or the ISA Study Profile, respectively.
It is expected that future versions of this specification will provide additional guidance on a comprehensive conversion of ARC metadata into RO-Crate metadata.