-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rearrange processes and modules, add SV workflow for Delly2 #8
Conversation
I'll also note that R package management is a nightmare for reproducibility. I finally hit on using renv, which allowed me to version-pin all of the packages. Bioconductor came close, but it doesn't include all packages and allows for "bugfix" updates within a larger release:
|
NFTest output (one of the tests failed): |
Thanks to @nkwang24, the SV test is now passing: |
I'll look over this later today or early tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general note, we'll want to include the main tool level directory at the end here
Also added a couple of suggestions for process names added in this PR
repeat_bed = "/hot/ref/database/RepeatMasker-3.0.1/processed/GRCh38/GRCh38_RepeatMasker_intervals.bed" | ||
|
||
// SV files | ||
// FIXME Should this be bundled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification: Is this a question of whether the files should be bundled into the Docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either into the Docker image, with the pipeline, or to a given reference path on disk. Put another way, is a user expected to (1) provide this file for each pipeline run, (2) have a standard copy locally, or (3) have it automatically provided for them by the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we can divide the various input files into the 3 categories you listed:
(1) RF models (6 tools x 2 conversion directions = 12 total @ ~10Mb - 1Gb) hosted separately for user to download
(2) Expect user to have standard resource files such as reference fastas, chain files, funcotator sources
(3) Bundle the non-standard resource files (repeat_bed, header_contigs, gnomad_rds) into the Docker
Is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - so I think that works out as:
- We upload RF models as attachments on pipeline releases.
- Users handle standard resource files.
- We bundle the non-standard resource files with the pipeline (the
repeat_bed
file is used outside of the docker image). That means they get checked into this repository and version-controlled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the non-standard resource files, it may be better to include as release attachments rather than version-control them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashpatel6 so you assert that there should be two categories of files?
- User-provided standard files
- Everything else, distributed as release attachments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I would suggest yes; the concern I would have about bundling the non-standard files into the Docker is the case where a user may want to make changes or provide a different file for those and having it bundled and then the user providing the paths in the config like other resources seems more consistent and allowing of that behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good! One remaining comment for the output directory:
pipeline-StableLift/config/methods.config
Line 12 in eb8983a
params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}" |
To include StableLift- at the end:
params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample_id.replace(' ', '_')}/StableLift-<manifest.version>"
It seems a bit redundant in this case since the main tool is the pipeline itself but this just brings it in line with the rest of the pipelines' output structure we follow
Okay @yashpatel6, I've added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Looks good!
Description
This addresses 99% of #7, although I still have one lingering issue in that the final VCF fails indexing.Closes #7.
I've rearranged all of the processes to be more modular - there are now three high-level blocks of validation (common), feature extraction (SNV or SV), and stability prediction (common). There are now two NFTest cases (one each for the SNV and SV branches), but they remain smoke tests without any assertions.
I've also added a pipeline diagram. I waffled back-and-forth on this but ultimately ended up using Mermaid rather than PlantUML so that I could use those "Parameterized Input" bubbles. GitHub's UI renders Mermaid code blocks, but for the README I manually rendered an SVG. I'm intending to extend our PlantUML-rendering action to do the same thing.
Testing Results
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample.