Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow input of recalibration tables #45

Merged
merged 9 commits into from
Dec 21, 2023
Merged

Conversation

yashpatel6
Copy link
Collaborator

Description

Adding the option to allow users to input BQSR recalibration tables if they already have them. These are especially useful when the pipeline fails after the BaseRecalibrator step and needs to be re-run. BaseRecalibrator is the longest running process in the pipeline and a bottleneck so existing tables can save significant compute time.

Also, updating to use the modularized set_env function as the current iteration in the pipeline causes issues when submitted through nftest

Testing Results

  • NFTest

    • log: /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/log-nftest-20231219T214754Z.log
    • cases: default set
  • Explicit recalibration table tests (Use of recalibration table can be confirmed from the trace file where the BaseRecalibrator process runs very briefly if the table already exists):

    • single sample: /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/single_with_recal
    • paired samples with one table: /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/paired_one_recal
    • paired samples with two tables: /hot/software/pipeline/pipeline-recalibrate-BAM/Nextflow/development/unreleased/yashpatel-use-recal-table/paired_two_recal

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline using NFTest, or I have justified why I did not need to run NFTest above.

@yashpatel6
Copy link
Collaborator Author

Note: the paired sample tests are still in progress

@yashpatel6
Copy link
Collaborator Author

Update: the paired test runs have completed successfully, with the runtime reflecting the time saved by providing the recalibration tables

Copy link

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Have you tested samples with multiple RGs as BQSR runs per RG? If I recall, the BQSR table is generated per sample and can contain multiple RGs within it. Is that correct?

README.md Outdated Show resolved Hide resolved
@yashpatel6
Copy link
Collaborator Author

Looks good. Have you tested samples with multiple RGs as BQSR runs per RG? If I recall, the BQSR table is generated per sample and can contain multiple RGs within it. Is that correct?

Right, the table is generated per sample and would contain all of the read groups within it; I haven't explicitly tested with multiple read groups since the intent here was to use the table that would've been generated and used anyways in the pipeline if there wasn't any failure. Given that the pipeline works with multiple read groups as is, I wouldn't expect any difference as long as the right table is given

Copy link

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the table is generated per sample and would contain all of the read groups within it; I haven't explicitly tested with multiple read groups since the intent here was to use the table that would've been generated and used anyways in the pipeline if there wasn't any failure. Given that the pipeline works with multiple read groups as is, I wouldn't expect any difference as long as the right table is given

Got it. The change looks good to me! Anything else to add? @Faizal-Eeman

@yashpatel6 yashpatel6 merged commit 7f10450 into main Dec 21, 2023
1 check passed
@yashpatel6 yashpatel6 deleted the yashpatel-use-recal-table branch December 21, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants