Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: run.cwl in run folders #112

Closed
caroott opened this issue Jul 4, 2024 · 9 comments
Closed

Discussion: run.cwl in run folders #112

caroott opened this issue Jul 4, 2024 · 9 comments

Comments

@caroott
Copy link
Member

caroott commented Jul 4, 2024

The ARC specification states under Workflow description, that that tools and workflows, that are used during computational analysis must be described in the workflows folder as .cwl files. In the run description it is stated, that each run needs a corresponding run.cwl, that describes how that exact run result is composed.

Due to the nature of CWL, this run.cwl may be unnecessary overhead or could be simplified. All necessary information about the run execution can be derived from the combination of the executed .cwl file and the run.yml. The run.yml is already located in the corresponding runs folder. So only the information of the CWL file that was executed remains. If one were to create the run.cwl as stated in the specification, I have two possibilities in mind:

1. Wrap the executed tool or workflow in another workflow:

This comes with the disadvantage, that it is quite a large overhead. All inputs required must be specified in the workflow again, and mapped to the inputs required by the tool/workflow. The outputs then must be collected as usual in a workflow.

In the worst case, the run.cwl file is almost like a copy of the referenced workflow.cwl.

2. Create a tool CWL, that executes the cwl runner with the given cwl and yml files:

Example:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cwltool, ../../workflows/MyWorkflow/workflow.cwl, run.yml]
outputs:
  myOutput:
    type: Directory
    outputBinding:
      # this returns the whole working directory
      glob: $(runtime.outdir)

This way, it's just the executing command wrapped in a command line tool CWL. It returns the entire output directory, so as long as the executed workflow is well described, it should return everything as intended. This could only be difficult, if expression tools are used at the end of a workflow to sort files. This is only a small overhead and contains all required information.

Since the information we require is only what workflow/tool is executed, can we maybe find a better way to represent that information? Or do we want to stick with the run.cwl and recommend the example i posted? Or do we want to recommend wrapping everything in one workflow again?


Edit: links, format, small adjustments

@github-actions github-actions bot added the Status: Needs Triage This item is up for investigation. label Jul 4, 2024
@caroott caroott moved this to In discussion in ARCStack Jul 4, 2024
@HLWeil HLWeil removed the Status: Needs Triage This item is up for investigation. label Jul 4, 2024
@HLWeil
Copy link
Member

HLWeil commented Jul 4, 2024

Reading again through this: This is a question specific to when a run executes a workflow, right? When the run is self-contained, the run.cwl is too?

@caroott
Copy link
Member Author

caroott commented Jul 4, 2024

That depends. I interpreted the ARC specification so, that every computational step should described either as a tool or workflow description and saved in the workflows folder. That wouldn't allow for self-contained runs, unless the run requires no computational steps.

@muehlhaus
Copy link
Member

This was simply due to a mistake: run.cwl is meant to be run.yml. The idea is that under workflow you find the more re-usable part and run is facilitated by the specific run parameter: especially the concrete input/output!!!

@caroott
Copy link
Member Author

caroott commented Jul 9, 2024

To add to this issue, after a discussion we had:
We have no way of telling how a run is intended to be executed, unless it is executed and a run report is generated in any way. So we need a way to declare the intention, which combination of cwl and yml file should be executed for the specific run.

Originally, there was the arc.cwl in the root, which should execute the whole ARC upon running. This was dropped for ease of use and to not overcomplicate things as I understood it. This would be one possibility to get the connection of workflow/tool file and jobfile for a run. The other possibility would be the example I posted above:

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cwltool, ../../workflows/MyWorkflow/workflow.cwl, run.yml]
outputs:
  myOutput:
    type: Directory
    outputBinding:
      # this returns the whole output directory
      glob: $(runtime.outdir)/myDir

One of those two possibilities, or a third one that handles it, should be implemented to get that connection info. It would also be useful to get input from other people working with ARCs, what they prefer for ease of use. What do you think about this issue @Brilator and @floWetzels ?

@Brilator
Copy link
Member

Brilator commented Jul 9, 2024

Do I understand the question correctly: how do we document what "run.yml" + "workflow.cwl" combination yield what output?
The way I currently do is similar to above, heaving a readme in the respective runs folder with something like
cwltool ../../workflows/MyWorkflow/workflow.cwl run.yml.

Plus I was planning to collect the overall ARC analysis / workflows with one arc.cwl in the root (currently more for visualization of the in-and-outs).

@caroott
Copy link
Member Author

caroott commented Jul 9, 2024

Yes, thats the question here. The arc.cwl in the root you mention would be the first case with the arc.cwl that executes the whole run. A readme in the runs folder also solves the question, at least for the user reading the ARC. The problem there would be how we ensure, that it follows a specific format and is also machine readable, so we can include it in the ARC datamodel.

@Brilator
Copy link
Member

Brilator commented Jul 9, 2024

Yes, I meant to confirm, that my non-machine-readable solution was aiming in the same direction.

Not sure about your outputs bound to directory. Or is this just one example and one would have to adapt for other workflows?

@caroott
Copy link
Member Author

caroott commented Jul 9, 2024

This output would vary between runs. Each run.cwl would have the directory where the run is stored written there

@caroott
Copy link
Member Author

caroott commented Jul 16, 2024

I would for now add Version 2 to the ARC specification. This way we have a way to accurately identify the intention of run execution and the run execution itself. If in the future a better solution comes up, this could be subject to change again.

@caroott caroott closed this as completed Jul 16, 2024
@github-project-automation github-project-automation bot moved this from In discussion to Done in ARCStack Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants