Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add a rule to ensure path directives are S3 URIs #44

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jaamarks
Copy link

Fixes #43

Description of Changes
Added a validation rule to the HealthOmics package to enforce s3:// paths for all input and output file paths specified via the path directive.

Description of how you validated changes

Validation was performed by adding two test cases: one with valid s3:// paths, which passed, and another with a local file path, which contains one violation as expected, demonstrating the rule's enforcement.


Checklist

  • If I have added a new rule, that rule has also been added to healthomics-nextflow-rules/src/main/resources/rulesets/healthomics.xml
  • If this change would make any existing documentation invalid, I have included those updates within this PR
  • I have added unit tests that prove my fix is effective or that my feature works
  • I have confirmed that I am able to locally build the project with no errors (./gradlew clean build)
  • I have not made extensive or unnecessary formatting changes unless this PR is specifically and only about reformatting
  • Title of this Pull Request follows Conventional Commits standard: https://www.conventionalcommits.org/en/v1.0.0/

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

- Add new rule to healthomics package
- Add new rule to resources xml: `healthomics-nextflow-rules/src/main/resources/rulesets/healthomics.xml`
- Add test cases
@jaamarks jaamarks marked this pull request as ready for review March 14, 2025 19:59
@jaamarks jaamarks requested a review from markjschreiber March 17, 2025 21:42
Copy link
Contributor

@markjschreiber markjschreiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great start but you might want to start with the use cases that are simplest to check and build up to the more complicated cases. Unfortunately Nextflow has many valid ways to declare a path and only some are simple to check if they are S3 or local paths.

path par.infile
}
'''
assertSingleViolation(SOURCE, 4, "path par.infile", "The file path must be either a string or params.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a violation. A path can be a variable and not just one from the params map. It could also be an expression https://www.nextflow.io/docs/latest/process.html#input-files-path

final SOURCE = '''
process bar {
input:
path 's3://my-input-bucket/data/input1.txt', 's3://my-input-bucket/data/input2.txt'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct but there are ways in which Path can be a list of files. https://www.nextflow.io/docs/latest/process.html#multiple-input-files

@jaamarks
Copy link
Author

It's a great start but you might want to start with the use cases that are simplest to check and build up to the more complicated cases. Unfortunately Nextflow has many valid ways to declare a path and only some are simple to check if they are S3 or local paths.

Yeah @markjschreiber , I agree that building up from some simple cases will be a good approach. In thinking about this, I believe that the simplest cases are

  • path 's3://my-input-bucket/data/foo.txt' (checking for direct s3://.* pattern)
  • path '/home/data/foo.txt' (checking for direct local pattern)

And then from there I am a little unclear how best to proceed. Because as you pointed out, we could also have substitutions like:

  • path params.foo
  • path other.foo
  • path '${foo}/${bar}'

But my understanding is that codenarc is a static analysis tool, thus we cannot evaluate the parameter values or variables during runtime of linter. Given this, what do you suggest for handling these cases?


And regarding you description in #44:

Check for things like http, https, ftp.

  • Does this mean we would like to allow arguments that match these patterns when supplied to the path directive, in addition to s3://.* patterns?

@markjschreiber
Copy link
Contributor

But my understanding is that codenarc is a static analysis tool, thus we cannot evaluate the parameter values or variables during runtime of linter. Given this, what do you suggest for handling these cases?

That is correct. In some cases you can infer simple non-static things like string concatenation but this means you have to keep lookup tables and have some processing so not simple. I think it's fine to keep your rule focussed on the simple cases, you just don't want it to trigger on something like path foo where foo is a parameter rather than a String like path './foo'

Does this mean we would like to allow arguments that match these patterns when supplied to the path directive, in addition to s3://.* patterns?

No, those are things that we also want to exclude and would also be good test cases for triggering the rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add check for input paths that are not S3 URIs
2 participants