Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dorado] New Dorado Basecalling Workflow Terra #659

Open
wants to merge 146 commits into
base: main
Choose a base branch
from

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Oct 24, 2024

This PR closes #713

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

A new Dorado Basecalling Workflow, a GPU-accelerated pipeline for basecalling Oxford Nanopore POD5 files. The workflow includes optional automatic model selection, SAM-to-BAM conversion, and demultiplexing into unique barcode fastq files, with outputs uploaded to a new user defined Terra table for further downstream analysis.

⚡ Impacted Workflows/Tasks

This is a new workflow that does not impact any other workflows

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

This PR introduces the following changes:

  • New Workflow: Dorado Basecalling Workflow version 1.0.
  • Optional Inputs: Added use_auto_model flag for automatic model selection.
  • Manual and Auto Model Options: Supports both predefined models and automatic selection (sup, hac, fast).
  • SAM-to-BAM Conversion: Integrated SAMTools task for efficient data handling.
  • Demultiplexing: Added demux step to create barcode-specific FASTQ files.
  • Terra Integration: Outputs transferred to Terra, with a table generated for downstream workflows.

⚙️ Algorithm

  • New Tasks:
    1. Dorado Basecall: Converts POD5 files to SAM using GPU acceleration. Uses a new Dorado Staph-B Docker image v0.80
      https://github.com/StaPH-B/docker-builds/tree/master/dorado/0.8.0
    2. SAMTools Convert: Converts SAM files to BAM.
    3. Dorado Demultiplexing: Creates barcode-specific FASTQ files.
    4. File Transfer: Uploads FASTQ files to Terra.
    5. Terra Table Creation: Generates Terra table from the uploaded FASTQ files.

➡️ Inputs

  • New Inputs:
    • use_auto_model (Boolean): Enables automatic model selection.
    • model_accuracy (String): Specifies model accuracy if using auto-selection (sup, hac, fast).
    • fastq_file_name (String): Prefix for output FASTQ files.
    • fastq_upload_path (String): Path to Terra for uploading FASTQ files.
    • kit_name (String): Specifies sequencing kit for adapter/barcode trimming.

⬅️ Outputs

  • New Outputs:
    • basecalled_fastqs: Array of FASTQ files generated from basecalling.
    • demuxed_fastqs: Array of FASTQ files generated from demultiplexing.
    • logs: Logs generated during the demux step.
    • terra_table_tsv: TSV file for uploading to Terra.

🧪 Testing

Test 2. 24 pod5 files from 2 barcodes (manual model)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/9bef28ea-82ba-4406-8545-f32de7e07e02

image

test 3. 24 files from 2 barcodes (auto mode)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/cead789e-c737-4541-a6ed-d9b907493ee1

output terra table example
image

  • Edge Case Handling: Verified workflow behavior with missing inputs and unsupported models.
  • Terra Integration: Confirmed successful transfer and Terra table generation with sample data.

Suggested Scenarios for Reviewer to Test

  1. Basecalling with Auto Model Selection: Run with the use_auto_model flag enabled.
  2. Manual Model Input: Test with a specific dorado_model path and confirm outputs.
  3. Demultiplexing: Verify barcode-specific FASTQ outputs.
  4. Edge Case: Provide incomplete inputs (e.g., missing kit_name) to confirm error handling.
  5. Terra Table Generation: Confirm Terra table creation and FASTQ uploads with valid inputs.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@AndrewLangvt
Copy link
Contributor

Good catch on some of the updates you recently made to the no trim option & also custom primers @fraser-combe. Structurally, I think this is nearly ready. @kapsakcj do you have bandwidth to do some additional testing of these new features?

@kapsakcj
Copy link
Contributor

@AndrewLangvt @fraser-combe Let me check with team leads first, but given the green light I should have time later this week or early next week to review Fraser's tests and do some of my own testing

@sage-wright
Copy link
Member

sage-wright commented Jan 17, 2025

Hey Fraser! I haven't had any chance to look at this PR yet, so here are some of my thoughts. Some of these things do not need to be implemented, but could be "nice-to-have" and so I've indicated them as such and removed the check box since they're mainly ideas for the future

  1. Documentation suggestions:
  • fix the typo in dorado_input_seelction.png file name
  • update the "last known changes" for dorado in the workflows_overviews and quick facts section to be vX.X.X
  • in workflows_type.md, cauris_cladetyper's last known changes has been accidentally overwritten to be v1.0.0 when it should be vX.X.X.
  • looks like a similar thing happened in workflows_kingdom.md, since kraken2's entry is accidentally duplicated. I think the version directly beneath the entry for Dorado_Basecalling should be removed
    The rest of these apply to dorado_basecalling.md:
  • NICE TO HAVE; NOT NECESSARY: please clarify the language in the introduction since the workflow doesn't utilize the terra data uploader itself. The user uploads the files outside of the workflow and the workflow scrapes the provided path to find the data - agree makes it clearer.
  • NICE TO HAVE; NOT NECESSARY: consider making "Example Manual Models" a level 4 header instead of a level 3 header so that it is beneath the "Model Type Selection" header
  • Turn the toggle for the supported kit names into a heading, or alternatively, expand the toggle by default. It’s easy to accidentally scroll past this.
  • move the "uploading pod5 files" section to be the very first section after the introduction and remove the " from the end of that header,
    • NICE TO HAVE; NOT NECESSARY: I would capitalize POD5 in the header as well to make sure the capitalization is the same everywhere
  • move the "Detailed Input Information" box and the "File Naming Guidelines" to be directly underneath the Inputs heading since this information is really useful and should be higher up!
    • in the detailed input information, the fastq_file name example ("project001-")should have the - removed since that (the -) is automatically added in the code
    • in the file naming guidelines, please update the examples since based on the described behavior, if you provide the accepted prefix in the example, the output file names would be "projectname-barcode01.fastq.gz-barcodeXX.fastq.gz"
  • In the inputs table, please update the runtime parameters to use the language described in the style guide for the docs
  • In the workflow tasks section, please add the expected !!! techdetails tables to these tasks as well
  1. dorado_basecall suggestions:
  • The output of this task is natively a BAM file, so instead of putting --emit-sam you can remove that parameter and skip the convert_sam_to_bam task! Efficiency!! Make sure to update all of the sam references to bam accordingly - thanks for pointing out missed the recent update with dorado allowing bam now
  • make sure to include all the runtime parameters to match the other tasks (the disk: disksize + "GB" ones or whatever they are) (and maybe do this for the other tasks if applicable)
    FC-Looks ok to me, basecall has different runtime params to run in gpu
  1. dorado_demx.wdl suggestions:
  • implement curtis' suggestion to add the --threads input parameter! this will speed up runtime
  • NICE TO HAVE; NOT NECESSARY: maybe rename the fastq_file_name variable to be output_file_prefix to be more clear about its purpose
  • remove the date command at the very end
  1. list_pod5_files wdl suggestions:
  • this task looks great!! I can see us using this task in a ton of other scenarios, but we would need to generalize this. Could you please:
    • create a file_ending variable to grep for instead of the hard-coded .pod5 ending
    • change all pod5 references to be generic
  1. wf_dorado_basecalling.wdl suggestions:
  • rename notrim to specifically reference the demux task (something like demux_no_trim) since we are trimming in a separate task and this could be confusing
  • fix whitespaces to match style guide
  • remove the paired_end and assembly_data and file_ending input variables and hard-code the create_terra_table task inputs to have paired_end = false, and assembly_data = false, (because those will always be false in this workflow). Also, hard-code the file_ending to be ".fastq.gz" as that is the file ending you are giving to all of the generated fastqs and letting the user have the ability to modify that may result in task failure

I have a final comment that would result in an efficiency improvement, but it's a bigger lift and functionally would behave exactly the same as what you're already doing so I don't want you to implement it. But I thought I would mention it here for the sake of our posterity.

Instead of doing a transfer_files -> create_terra-table, I think we could save a fair bit of runtime by sending the array of fastq files directly into a new task that will make a terra table using that array and then upload that table. It would act like create_terra_table but instead of finding data in a directory, it would take the data from an input array.

This looks great!!

@fraser-combe
Copy link
Contributor Author

Ready for review,
Added a new task dorado_trim.wdl that takes the custom primers fasta file and trims any sequences that are identified after the dorado_demultiplexing step (optional file if not present skips) tested here and confirmed locally
No trim functionality tested and successful locally and in terra for dorado_demux update
no trim set to false
no trim set to true

Sage comments updated throughout - thanks some ideas I had not thought of

Documentation updated to reflect new inputs and new trim task
removed samtools convert task with updated dorado now outputting bams

Updated Dorado to v 0.9.0 and updated all docker images to current version in basecall, demux and trim steps tested here (currently still running)

Copy link
Contributor

@AndrewLangvt AndrewLangvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few follow ups that are outstanding from Sage's initial comments.

docs/workflows_overview/workflows_alphabetically.md Outdated Show resolved Hide resolved
docs/workflows_overview/workflows_kingdom.md Outdated Show resolved Hide resolved
docs/workflows_overview/workflows_type.md Outdated Show resolved Hide resolved
docs/workflows/standalone/dorado_basecalling.md Outdated Show resolved Hide resolved
docs/workflows/standalone/dorado_basecalling.md Outdated Show resolved Hide resolved
tasks/basecalling/task_dorado_trim.wdl Show resolved Hide resolved
workflows/utilities/data_import/wf_create_terra_table.wdl Outdated Show resolved Hide resolved
tasks/utilities/file_handling/task_list_pod5_files.wdl Outdated Show resolved Hide resolved
@fraser-combe
Copy link
Contributor Author

@AndrewLangvt
Copy link
Contributor

@kapsakcj are you able to take another look at this, following your previous comments & to ensure it will meet the needs of CDPH?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dorado basecalling workflow enhancements
6 participants