host_filter.wdl modernization #70

mlin · 2022-05-31T03:24:08Z

copilot:summary

valenzuelaomar · 2022-10-18T22:16:02Z

short-read-mngs/host_filter.wdl

+    File human_bowtie2_index_tar
+    File human_hisat2_index_tar


what if the host genome is not a human? would I still provide the custom host genome file path as a value to human_bowtie2_index_tar?

I'm trying to figure out how the inputs here change depending on human vs non-human hosts

@ovalenzuela19 Good question. If the host genome is non-human then we filter against both the host and human genomes. This protects privacy in case for example, we're sequencing a mosquito carrying some blood from a human it recently bit. So the pipeline always takes in the human index files in addition to the host index files (human or not).

Ahh this makes sense! Thanks so much!

valenzuelaomar · 2022-10-28T19:00:30Z

short-read-mngs/host_filter.wdl

-    String host_genome
-    String genome_dir = "STAR_genome/part-0/"
+  # Adapter trimming and QC filtering
+  call fastp_qc {


nit: can we rename the steps slightly? I believe our step names are in camelCase and start with a verb

e.g. fastp_qc => RunFastpQc

mlin · 2023-03-31T09:34:51Z

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar Please have a look over this PR before merging at last. I took out the indexing from this since that's subsumed in #182. Note that this replaces the existing host_filter.wdl, not sure if there might be value in keeping both versions around (on HEAD)?

mlin · 2023-04-06T20:30:08Z

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar In today's pipeline meeting we discussed that indeed, it seems like a good idea to keep both versions of the WDL alive on the main branch, since we expect to continue supporting the original version for some time.

One way to do this would be to keep the current diff that replaces host_filter.wdl and rewires other parts that refers to it; and simply add original_host_filter.wdl alongside. There are other ways of course, but let me put that one forward as a strawperson.

Thoughts? would this need some change to whatever part of the system actually launches the original version if so requested?

rzlim08 · 2023-04-06T23:10:47Z

workflows/short-read-mngs/test/test_short_read_mngs.py

@@ -31,7 +31,7 @@ def test_bench3_viral(short_read_mngs_bench3_viral_outputs):
        taxon_counts = json.load(infile)["pipeline_output"]["taxon_counts_attributes"]

    taxa = set(entry["tax_id"] for entry in taxon_counts)
-    assert len(taxa) == 177
+    assert abs(len(taxa) - 184) < 16


Are we expecting to introduce some variability into the host filtering? Or was this just a holdover from the old assert?

rzlim08 · 2023-04-06T23:19:56Z

@ovalenzuela19 @morsecodist @rzlim08 @katrinakalantar In today's pipeline meeting we discussed that indeed, it seems like a good idea to keep both versions of the WDL alive on the main branch, since we expect to continue supporting the original version for some time.

One way to do this would be to keep the current diff that replaces host_filter.wdl and rewires other parts that refers to it; and simply add original_host_filter.wdl alongside. There are other ways of course, but let me put that one forward as a strawperson.

Thoughts? would this need some change to whatever part of the system actually launches the original version if so requested?

Thanks Mike and sorry for the late review.

One option is to just use host_filter.wdl as a workflow manager and reference other wdl files e.g. (old_host_filter.wdl, new_host_filter.wdl). The wdl files already get miniwdl zipped up so we could just call the zip files from the webapp.

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

mlin · 2023-04-07T09:57:13Z

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

@rzlim08 I think it would be closer to the second case unfortunately. Would you still want to see that approach in view of that potential awkwardness? (Thx- I'm just not familiar enough with the pieces of code actually invoking these workflows to understand the constraints here)

rzlim08 · 2023-04-12T17:48:53Z

I'm not sure if it's possible to have wdl expect different inputs given a flag, but that would be nice. Otherwise I guess we'd have to made almost all of the differing inputs optional.

@rzlim08 I think it would be closer to the second case unfortunately. Would you still want to see that approach in view of that potential awkwardness? (Thx- I'm just not familiar enough with the pieces of code actually invoking these workflows to understand the constraints here)

I think the other option would be to run with 2 WDL files and the webapp could choose between them. This would likely mean one of the wdl's would be set as a "default" for e.g. local testing/ benchmarking, but the webapp could have the logic to run one vs the other. This might be the easiest solution for now.

mlin · 2023-04-14T17:12:29Z

@ovalenzuela19 The tentative solution here is to merge with the modern host filtering WDL replacing the original one as host_filter.wdl, and adding the original one alongside as original_host_filter.wdl. Everything else in the repo (local_driver.wdl, amr/run.wdl, unit tests, etc.) would be wired to host_filter.wdl and not original_host_filter.wdl.

Question for you- is that going to work for the webapp as far as how it invokes the original pipeline when asked to do so? Or does it need the zip to have host_filter.wdl as the original pipeline, in which case there would not be much point (and might cause additional confusion) to add original_host_filter.wdl at all.

Thanks, we're obviously trying to avoid breaking anything that depends on the structure of this repo that aren't necessarily obvious when making these kinds of changes.

valenzuelaomar · 2023-04-14T17:47:38Z

@mlin so would original_host_filter.wdl just be there for reference and not really usable?

I think from a webapp perspective that solution should be fine

mlin · 2023-04-14T19:48:39Z

@ovalenzuela19

would original_host_filter.wdl just be there for reference and not really usable?

Mostly yes, for reference. The unanswered bit is what would we do if a user absolutely needs us to make some further change/bugfix to the original pipeline they're still depending on. How would we roll that out as a tagged release & zip file the webapp can use, once it's been renamed to original_host_filter.wdl?

If we're optimists we would say that's fairly unlikely to happen, to the extent we can punt now and figure it out later if it does, but, thought we should at least air it out here. cc @rzlim08 @morsecodist @katrinakalantar for visibility

rzlim08 · 2023-04-14T21:22:12Z

@ovalenzuela19

would original_host_filter.wdl just be there for reference and not really usable?

Mostly yes, for reference. The unanswered bit is what would we do if a user absolutely needs us to make some further change/bugfix to the original pipeline they're still depending on. How would we roll that out as a tagged release & zip file the webapp can use, once it's been renamed to original_host_filter.wdl?

If we're optimists we would say that's fairly unlikely to happen, to the extent we can punt now and figure it out later if it does, but, thought we should at least air it out here. cc @rzlim08 @morsecodist @katrinakalantar for visibility

Yeah I think the main question is how to support both pipelines at the same time indefinitely. If we have both WDLs we'd at least be able to patch either. I'll see if I can make a small customization to pin a release to a major/minor version. That being said I don't think we should be running with this forever and would like to drop support for the old version eventually if we're comfortable enough with the new one.

valenzuelaomar · 2023-04-14T23:10:19Z

Yeah I think the main question is how to support both pipelines at the same time indefinitely. If we have both WDLs we'd at least be able to patch either. I'll see if I can make a small customization to pin a release to a major/minor version. That being said I don't think we should be running with this forever and would like to drop support for the old version eventually if we're comfortable enough with the new one.

What if we just create a new workflow that is just the old short-read-mngs pipeline with the old host filtering wdl? That way we can still maintain it & package it like normal and it's up to the web app which wdl will be used? By default it'll use the existing short-read-mngs workflow which uses the modern host filtering stage.

That way if we need to run the old v7 short-read-mngs pipeline, the web app can just have some conditional in place to fetch the docker image for the short-read-mngs old version. LMK if this approach sounds absurd

This reverts commit aeb234f.

* fix bowtie2 counts for single file * fix extra expansions * relieve hisat2 dependency * single sample hisat2 * fix hisat2 * fix dockerfile for hisat2 --------- Co-authored-by: Omar Valenzuela <[email protected]>

…219) * Revert "output gene id in primary output file (#209)" This reverts commit 2d9ff56. * Revert "Output non host reads and non host contigs for AMR (#205)" This reverts commit 9de3fc2.

* legacy-host-filter-inital-commit * linting * add stage io map * remove stage io map swp file

…branch (#219)" (#226) This reverts commit 227a489.

kislyuk · 2023-05-06T16:09:13Z

Whoa. Just noticed this! Huge step forward for czid. Congrats on landing this!

valenzuelaomar reviewed Oct 18, 2022

View reviewed changes

valenzuelaomar reviewed Oct 28, 2022

View reviewed changes

mlin force-pushed the mlin/modernize-host-filter branch from 489c116 to acdff09 Compare November 16, 2022 09:33

mlin added 27 commits January 27, 2023 09:56

fastp

e92a5e0

fastp single

24f6841

bowtie2 run

06517ea

hisat2 run

6c485e4

dedup run

623a906

run subsample

0c75c04

run kallisto

c45b5c3

adjust index tar filenames

18917cc

polishing

029ab72

polishing

15d65fe

count reads in each step

bc8aebe

Create host_filter_indexing.wdl

db223c8

boost fastp complexity threshold

d0a5ebe

output fastp report

b60afa1

build fastp from our fork with SDUST complexity filtering

8e308bd

use fastp --sdust_complexity_filter

17e31c5

bump

7bcb00f

bump

b70ce8f

tune

052120e

stub the remaining step descriptions

169336f

wire to tests

f2e27ca

and auto_benchmark

d7b3959

fixup tests

22cc6c2

fixup tests

0089204

fixup tests

a17a508

fixup tests

f96344e

fixup tests

97fbb83

mlin marked this pull request as ready for review March 31, 2023 08:44

delete host_filter_indexing since it's subsumed in #182

e35cd04

mlin changed the title ~~host_filter.wdl modernization WIP~~ host_filter.wdl modernization Mar 31, 2023

mlin requested review from valenzuelaomar, katrinakalantar, morsecodist and rzlim08 March 31, 2023 09:17

rzlim08 reviewed Apr 6, 2023

View reviewed changes

valenzuelaomar and others added 9 commits April 14, 2023 16:18

Merge branch 'main' into mlin/modernize-host-filter

8e981c4

fix glob patterns in read counting

aeb234f

Revert "fix glob patterns in read counting"

74304e0

This reverts commit aeb234f.

[Bug] fix count expansion for single file short-read-mngs (#216)

ec91bf3

* fix bowtie2 counts for single file * fix extra expansions * relieve hisat2 dependency * single sample hisat2 * fix hisat2 * fix dockerfile for hisat2 --------- Co-authored-by: Omar Valenzuela <[email protected]>

Merge branch 'main' into mlin/modernize-host-filter

021d183

Remove AMR changes that are a WIP from modern host filtering branch (#…

227a489

…219) * Revert "output gene id in primary output file (#209)" This reverts commit 2d9ff56. * Revert "Output non host reads and non host contigs for AMR (#205)" This reverts commit 9de3fc2.

tune hisat2 memory usage (#223)

3ad5ad0

Legacy Host Filter initial commit (#224)

3976349

* legacy-host-filter-inital-commit * linting * add stage io map * remove stage io map swp file

Revert "Remove AMR changes that are a WIP from modern host filtering …

ba46f53

…branch (#219)" (#226) This reverts commit 227a489.

rzlim08 merged commit 60a7e78 into main Apr 25, 2023

rzlim08 deleted the mlin/modernize-host-filter branch April 25, 2023 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

host_filter.wdl modernization #70

host_filter.wdl modernization #70

mlin commented May 31, 2022 •

edited by valenzuelaomar

Loading

valenzuelaomar Oct 18, 2022 •

edited

Loading

mlin Oct 19, 2022 •

edited

Loading

valenzuelaomar Oct 19, 2022

valenzuelaomar Oct 28, 2022

valenzuelaomar Oct 28, 2022

mlin commented Mar 31, 2023

mlin commented Apr 6, 2023

rzlim08 Apr 6, 2023

rzlim08 commented Apr 6, 2023

mlin commented Apr 7, 2023

rzlim08 commented Apr 12, 2023

mlin commented Apr 14, 2023

valenzuelaomar commented Apr 14, 2023

mlin commented Apr 14, 2023

rzlim08 commented Apr 14, 2023

valenzuelaomar commented Apr 14, 2023 •

edited

Loading

kislyuk commented May 6, 2023

host_filter.wdl modernization #70

host_filter.wdl modernization #70

Conversation

mlin commented May 31, 2022 • edited by valenzuelaomar Loading

valenzuelaomar Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

mlin Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

valenzuelaomar Oct 19, 2022

Choose a reason for hiding this comment

valenzuelaomar Oct 28, 2022

Choose a reason for hiding this comment

valenzuelaomar Oct 28, 2022

Choose a reason for hiding this comment

mlin commented Mar 31, 2023

mlin commented Apr 6, 2023

rzlim08 Apr 6, 2023

Choose a reason for hiding this comment

rzlim08 commented Apr 6, 2023

mlin commented Apr 7, 2023

rzlim08 commented Apr 12, 2023

mlin commented Apr 14, 2023

valenzuelaomar commented Apr 14, 2023

mlin commented Apr 14, 2023

rzlim08 commented Apr 14, 2023

valenzuelaomar commented Apr 14, 2023 • edited Loading

kislyuk commented May 6, 2023

mlin commented May 31, 2022 •

edited by valenzuelaomar

Loading

valenzuelaomar Oct 18, 2022 •

edited

Loading

mlin Oct 19, 2022 •

edited

Loading

valenzuelaomar commented Apr 14, 2023 •

edited

Loading