Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework/Add CWL Part #111

Merged
merged 14 commits into from
Sep 10, 2024
Merged

Rework/Add CWL Part #111

merged 14 commits into from
Sep 10, 2024

Conversation

caroott
Copy link
Member

@caroott caroott commented Jul 3, 2024

This PR adds specifications for the metadata associated with workflows and runs. It also reworks the existing CWL part. #110

ARC specification.md Show resolved Hide resolved
ARC specification.md Show resolved Hide resolved
Copy link
Member Author

@caroott caroott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per this discussion, the root arc.cwl was replaced by individual run.cwl files

@kMutagene
Copy link
Member

@caroott what is missing to get this merged?

@caroott caroott marked this pull request as ready for review September 2, 2024 11:24
@caroott caroott requested a review from kMutagene September 2, 2024 11:24
@HLWeil
Copy link
Member

HLWeil commented Sep 2, 2024

Please, as @kMutagene already suggested, stick to the MUST/SHOULD/MAY specification conventions used throughout other sections of the specification.

@caroott
Copy link
Member Author

caroott commented Sep 2, 2024

Are you referring to the capitalization of MUST/SHOULD/MAY or the usage? Because the part @kMutagene highlighted is a part of the specification that has not been changed by this PR

@HLWeil
Copy link
Member

HLWeil commented Sep 2, 2024

Both new and existing parts of text. Maybe you could go through the sections you edited?

ARC specification.md Outdated Show resolved Hide resolved
Copy link
Member

@HLWeil HLWeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just read through it again and it already looks pretty good. But I would still suggest to fix a few points (which I commented) before merging.

Especially one thing I think needs some clarification. If I understoof it correctly, the Run Metadata and Workflow Metadata section refer to the additional metadata in yaml format that can extend all file defined in the cwl specification, right?
This does not really come through here. I would wish for an explicit clafirication which states this.

@@ -26,7 +26,7 @@ Licensed under the Creative Commons License CC BY, Version 4.0; you may not use
- [Additional Payload](#additional-payload)
- [Top-level Metadata and Workflow Description](#top-level-metadata-and-workflow-description)
- [Investigation and Study Metadata](#investigation-and-study-metadata)
- [Top-Level Run Description](#top-level-run-description)
- [Individual Run Description](#individual-run-description)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not fit under "Top-level Metadata" anymore if it's about run-specific metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the individual workflow description, but the description which workflow was executed with which job file. It also replaces the arc.cwl, which intended to do this for the whole arc. Since it is then the highest level description of the run, it would still be top level in my opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the top-level does not refer to top-level per metadata kind, but rather top-level on the ARC in general.


## Run Description

**Runs** in an ARC represent all artefacts that result from some computation on the data within the ARC, i.e. [assays](#assay-data-and-metadata) and [external data](#external-data). These results (e.g. plots, tables, data files, etc. ) MUST reside inside one or more subdirectory of the top-level `runs` directory.

Each such subdirectory must contain a workflow description `run.cwl`, given in [Common Workflow Language](https://www.commonwl.org/) (CWL), [v1.2](https://www.commonwl.org/v1.2/) or higher, that describes how the files contained with the run are derived from assay or external data, or other runs. `run.cwl` MUST be placed in the subdirectory under the top-level `runs` directory. A parameter file `run.yml` MAY be given to specify run-specific input parameters.
Each such subdirectory MUST contain a workflow description `run.cwl`, given in [Common Workflow Language](https://www.commonwl.org/) (CWL), [v1.2](https://www.commonwl.org/v1.2/) or higher, that describes how the files contained with the run are derived from assay or external data, or other runs. `run.cwl` MUST be placed in the subdirectory under the top-level `runs` directory. A parameter file `run.yml` MAY be given to specify run-specific input parameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an article, (e.g. the) needed before the run.cwl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me, run.cwl stands as a replacement for the actual file in the specs and i would leave it without an article. The run CWL file would be the other case, where i would use it. But I'm fine with either one


## Run Description

**Runs** in an ARC represent all artefacts that result from some computation on the data within the ARC, i.e. [assays](#assay-data-and-metadata) and [external data](#external-data). These results (e.g. plots, tables, data files, etc. ) MUST reside inside one or more subdirectory of the top-level `runs` directory.

Each such subdirectory must contain a workflow description `run.cwl`, given in [Common Workflow Language](https://www.commonwl.org/) (CWL), [v1.2](https://www.commonwl.org/v1.2/) or higher, that describes how the files contained with the run are derived from assay or external data, or other runs. `run.cwl` MUST be placed in the subdirectory under the top-level `runs` directory. A parameter file `run.yml` MAY be given to specify run-specific input parameters.
Each such subdirectory MUST contain a workflow description `run.cwl`, given in [Common Workflow Language](https://www.commonwl.org/) (CWL), [v1.2](https://www.commonwl.org/v1.2/) or higher, that describes how the files contained with the run are derived from assay or external data, or other runs. `run.cwl` MUST be placed in the subdirectory under the top-level `runs` directory. A parameter file `run.yml` MAY be given to specify run-specific input parameters.

`run.cwl` MAY (and sensibly, should) refer to assay data files, external data files, workflow descriptions, and files in other run results; such references MUST use relative paths. Furthermore, `run.cwl` MUST specify as outputs all result files. `run.cwl` MUST BE executable without referring to [additional payload files](#additional-auxiliary-payload) or files outside the ARC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add clarifications for the relative paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an example for relative paths in General Patterns, but i think it would make sense here, since the paths are not relative to the arc root, but rather relative to the job file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe just add a helping sentence or use some phrasing like relative to the cwl file

- It is strongly encouraged to include author and contributor metadata in run descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html).
### Run Metadata

- For metadata annotation, namespaces and schemas SHOULD be referenced, as shown in the [CWL metadata user guide](https://www.commonwl.org/user_guide/topics/metadata-and-authorship.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. missing at the end

@@ -218,7 +231,20 @@ Notes:

- It is expected that run descriptions are authored semi-automatically, e.g. using the [arcCommander](https://github.com/nfdi4plants/arcCommander) tool.

- It is strongly encouraged to include author and contributor metadata in run descriptions as [CWL metadata](https://www.commonwl.org/user_guide/17-metadata/index.html).
### Run Metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a link to the profile here? Especially the last bullet point

This is mainly done using the processSequence (currently about).

seems off without some more context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this would help

@@ -244,11 +270,11 @@ The ARC root directory is identifiable by the presence of the `isa.investigation
Multiple studies MUST be stored using one worksheet per study in `isa.studies.xlsx` in the root directory of the ARC.
The study-level SHOULD define [ISA factors](https://isa-specs.readthedocs.io/en/latest/isamodel.html#study) of a study and MAY contain overlapping information also to be found in all assays grouped by the study. -->

### Top-Level Run Description
### Individual Run Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this section does not fit into the supersection anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my comment above, we could return the name to Top-Level i think. If not i would agree that it should be moved

@kMutagene kMutagene changed the base branch from main to dev September 5, 2024 14:31
@caroott caroott requested a review from HLWeil September 9, 2024 14:41
Copy link
Member

@HLWeil HLWeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one change, otherwise lgtm 👍


- This is mainly done using the processSequence (which currently maps to the [about](https://schema.org/about) type of LabProcess, see [here](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/release/profile/isa_ro_crate_mapping.md)).

### Individual Run Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would cut out this section, as it is just duplicated information now.

Copy link
Member

@HLWeil HLWeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @kMutagene?


- Add metadata annotation as shown in the [CWL metadata user guide](https://www.commonwl.org/user_guide/topics/metadata-and-authorship.html).

- Namespaces and schemas SHOULD be referenced (e.g. [Lab Protocol](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md#labprotocol).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing parenthesis


- Metadata relevant to the tool description or workflow description SHOULD be added. This metadata MUST be limited to only metadata that directly describes the processing unit. Metadata describing the run parameters MUST be added to the `run.yml` parameter file.

- The properties of [Lab Protocol](https://github.com/nfdi4plants/isa-ro-crate-profile/blob/main/profile/isa_ro_crate.md#labprotocol) and [Computational Workflow](https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE#nav-description)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH i do not understand how this relates to the ROCrate/schema.org types you mention here, as the way to annotate metadata in cwl files (mentioned in the first bullet point in this section) links to a completely different format than that. Does this mean that the property names of those types should be used in the CWL metadata format?

Copy link
Member

@kMutagene kMutagene Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also where possible, stick to one line per sentence in markdown files to improve reviewability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what it means. You can reference the namespaces and use them in the cwl format to annotate your data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reference the namespaces and use them in the cwl format to annotate your data

Then i'd suggest to include that explanation here. It might be findable in the CWL metadata specification, but this is so integral that it should be explicitly explained here IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a bullet point which hopefully clarifies it

@kMutagene kMutagene merged commit 7bbc1e0 into nfdi4plants:dev Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add/improve explicit CWL-related section(s)
3 participants