Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

organize for metapipeline #23

Merged
merged 20 commits into from
Dec 1, 2023
Merged

Conversation

alkaZeltser
Copy link
Collaborator

@alkaZeltser alkaZeltser commented Nov 8, 2023

Description

Updates to turn this pipeline into a useful component of metapipline.

  • Added a couple new processes to handle merging of target and off-target intervals
  • Rearranged workflow logic to more intuitively follow pipeline default functions
  • Added parameterization of many optional outputs
  • Updated README (A LOT)
  • Created a workflow diagram

Closes #15

Urgent things to address before official release:

  • Organize tool-specific reference files in appropriate /hot/ref/ location.
    • See Notes on off-target coverage calculation filtered by dbSNP loci #18 for notes on dbSNP reference file formatting. /hot/user/nzeltser/pipeline-targeted-coverage/test/input/dbSNP-155_thinned_hg38.vcf.gz
    • Additional non-standard ref files: genome sizes required by bedtools slop and provided with the bedtools distribution: /hot/user/nzeltser/tools/bedtools2/genomes/human.hg38.genome
  • Confirm standardized output directory naming. Currently all outputs fall under SAMtools-1.16.1 as this is the tool behind the main read depth calculation process - do we want to keep this?
  • Should the targeted sequencing dataset be converted into an official test set?
  • Do I need to implement checksum creation? Does not currently exist.

Testing Results

Testing set is the BZPRGPT2 PRAD targeted panel pilot cohort 2 with original target file.

  • Case 1: Just CollectHsMetrics
    • output: /hot/software/pipeline/pipeline-targeted-coverage/Nextflow/development/unreleased/nzeltser-organize-for-metapipeline/hs-metrics-only
  • Case 2: run all processes that produce a final output; no intermediate files
    • output: /hot/software/pipeline/pipeline-targeted-coverage/Nextflow/development/unreleased/nzeltser-organize-for-metapipeline/all-final-outputs-no-intermediates
  • Case 3: run all processes and output intermediate files
    • output: /hot/software/pipeline/pipeline-targeted-coverage/Nextflow/development/unreleased/nzeltser-organize-for-metapipeline/all-outputs-with-intermediates

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments:

main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
Comment on lines +89 to +93
// Calculate HsMetrics
if ( params.collect_metrics ) {

//Get intervalLists (only needed for collectHsMetrics)
//if you already have the interval list use that, otherwise, generate interval list from BedToIntervalList process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary in this PR but I'd recommend trying to put the processes under each if statement into their own workflows to simplify main.nf

module/merge_bedfiles_bedtools.nf Outdated Show resolved Hide resolved
module/merge_bedfiles_bedtools.nf Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not necessarily in this PR, but I'd recommend using PlantUML to diagram for pipelines as it then allows for the diagram to be tracked as code and is easily modifiable compared to just image files that can be hard to modify and track changes for

@yashpatel6
Copy link
Contributor

Urgent things to address before official release:

* Organize tool-specific reference files in appropriate /hot/ref/ location.
  * See [Notes on off-target coverage calculation filtered by dbSNP loci #18](https://github.com/uclahs-cds/pipeline-targeted-coverage/discussions/18) for notes on dbSNP reference file formatting. /hot/user/nzeltser/pipeline-targeted-coverage/test/input/dbSNP-155_thinned_hg38.vcf.gz
  * Additional non-standard ref files: genome sizes required by `bedtools slop` and provided with the `bedtools` distribution: /hot/user/nzeltser/tools/bedtools2/genomes/human.hg38.genome

* Confirm standardized output directory naming. Currently all outputs fall under `SAMtools-1.16.1` as this is the tool behind the main read depth calculation process - do we want to keep this?

Generally, yes we want the outputs under the main tool of the pipeline; since this pipeline really only has one main tool for performing its main function, that should stay as is (as opposed to pipelines like call-sSNV where there are multiple main tools since multiple tools for calling sSNVs are implemented)

* Should the targeted sequencing dataset be converted into an official test set?

It should, through NFTest, become a formal test case; in terms of eventually publicizing the pipeline (either through metapipeline or on its own), consider any potential concerns about whether the targeted sequencing dataset can be public in any way

* Do I need to implement checksum creation? Does not currently exist.

I would recommend yes; we have a modularized process for it so it should be straight-forward to add - https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/generate-checksum

@alkaZeltser
Copy link
Collaborator Author

* Do I need to implement checksum creation? Does not currently exist.

I would recommend yes; we have a modularized process for it so it should be straight-forward to add - https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/generate-checksum

Can you share an up-to-date implementation example from an existing pipeline?

@yashpatel6
Copy link
Contributor

* Do I need to implement checksum creation? Does not currently exist.

I would recommend yes; we have a modularized process for it so it should be straight-forward to add - https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/PipeVal/generate-checksum

Can you share an up-to-date implementation example from an existing pipeline?

Here's an example of it that was used when the module was created and tested: https://github.com/uclahs-cds/pipeline-align-DNA/blob/8b5ba8508df4f3affd9391711190e56f8e2cf1ec/module/align_DNA_BWA_MEM2.nf#L19-L26

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! There are some things that can be cleaned up but not needed in this PR. Can you move the outputs with the checksums to the dev directory?

Anything else to add @sorelfitzgibbon ?

main.nf Outdated Show resolved Hide resolved
@sorelfitzgibbon
Copy link

Anything else to add @sorelfitzgibbon ?

I think I have a few comments, please give me until tomorrow.

Copy link

@sorelfitzgibbon sorelfitzgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few minor comments.

module/run_HS_metrics.nf Outdated Show resolved Hide resolved
module/depth_to_bed.nf Outdated Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably part of the "clean up" @yashpatel6 suggested leaving for another PR: Much of the methods.config is out of date and could be replaced with those in pipeline-Nextflow-config as a submodule.

@alkaZeltser
Copy link
Collaborator Author

Generally looks good! There are some things that can be cleaned up but not needed in this PR. Can you move the outputs with the checksums to the dev directory?

Done!
/hot/software/pipeline/pipeline-targeted-coverage/Nextflow/development/unreleased/nzeltser-organize-for-metapipeline/all-outputs-with-checksums

@alkaZeltser alkaZeltser merged commit 1579289 into main Dec 1, 2023
1 check passed
@alkaZeltser alkaZeltser deleted the nzeltser-organize-for-metapipeline branch December 1, 2023 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a default hsMetrics --NEAR_DISTANCE param
3 participants