Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] refactor #428

Draft
wants to merge 79 commits into
base: master
Choose a base branch
from
Draft

[DRAFT] refactor #428

wants to merge 79 commits into from

Conversation

daler
Copy link
Contributor

@daler daler commented Jan 3, 2025

This is an experimentation branch.

Goal: simplify reference handling

Current system

The current system has reference configs with fairly complex configuration. There is capability for recursively importing them, and the main functionality they afford is the ability to extremely easily configure fastq-screen runs, and effectively built-in library of references for a wide range of model organisms.

However, there are layers of abstraction that make it hard to troubleshoot or even understand:

  • the mismatch between references_dict (for easier computer use) and the actual config file (for easier human use)
  • the super awkward way of accessing references (c.refdict[c.organism][config['gtf']['tag']]['annotation'])
  • the way configs are handled across snakefiles (sometimes file, sometimes dict)
  • the need to manually specify indexes and conversions
  • the initial goal of being able to run the references workflow as a standalone to build everything at once in a central location isn't as useful as I thought it would be
  • a central references location worked great for a while. But over time, there are tricky things with same genome version but different versions of aligner, which implies being able to track organism, tag, and version of aligner that made the index. This is a hard problem to solve and will need more complexity.

New system

  • Every experiment will build its own references rather than use a central location. This takes more time and a little more space, but it ensures that each project can remain independent of all others.
  • The postprocessing will remain, but instead of using reference configs, the suggested postprocessing will be maintained in documentation rather than ever-growing reference config yamls.
  • include/reference_configs will be removed, and migrated to docs

Goal: more modular workflows

Recent snakemake versions support using modules; refactor as much as possible across rnaseq and chipseq in to consolidated modules (that can then have their inputs/outputs modified).

Current system

ChIP-seq and RNA-seq workflows have some common parts (references) factored out, but still have others (fastqc) that are repeated.

New system

Add as much as possible to rules/*.smk files, and import those as modules. Still playing around with how much exactly to put in the separate rules files.

Notes upon trying this out:

I tried moving all rules into rules/*/*.smk files as modules to be imported with possibly-overridden inputs/outputs. This does not work cleanly because the global is_paired and patterns are unavailable, unless we re-import them into every module which would get messy. And I think their utility in the global namespace is too great to get rid of. So I made the rules in the module more generic, overriding their inputs and outputs in the main snakefile that imported them. But this also means overriding the params (typically usesis_paired) and resources (utils.autobump), and by the time you do that, you've largely re-implemented everything except the shell: or run: block. So the end result is that you just made things more confusing by splitting the rule across two files.

So then I tried moving individual rules into separate .smk files to be used with include:, much like this apparently canonical example. But then it wasn't clear how to split up the various rules. I tried these two schemes:

rules/qc/fastqc.smk
rules/qc/rseqc.smk
rules/qc/preseq.smk
rules/aligners/star.smk
...etc

Or

rules/qc.smk.  # has fastqc, rseqc, preseq, etc etc
rules/aligners.smk. # has hisat2, star, etc
...etc

Neither of these felt like it was a net benefit. Storing everything in separate files and include:ing them cleans up the main snakefile tremendously of course. But by fragmenting the rules into files like that, you lose all of the visual consistency of seeing patterns[] and autobump and other commonalities across the rules, which is otherwise a signal that those things are useful to understand and use. It's just much harder to get a cohesive sense of the workflow, unless you open up all of those separate .smk files.

The refactor-separate-smk branch has a (very messy and incomplete) prototype of that, just to get a feel for it. After trying it out, I feel like there is still a benefit of having everything together in the same file as in the original design.

And it turns out that modules are not as useful as I was originally thinking.

Goal: reduce complexity where possible

Current system

The config objects hide a lot of behavior that is unclear. Better to use standard snakemake conventions where possible.

New system

I got this working for the RNA-seq workflow:

  • Load configfile directly -- removing reference complexity means we can just use a plain ol' config yaml.
  • Store global is_paired, is_sra, and sampletable
  • Use patterns because it's very convenient (and I have plans for using that patterns yaml for reporting later). It's a pretty clear mapping of "thing written in yaml file equals thing in snakemake". But don't use targets because that's much less clear where they come from. Instead, use expand() on patterns. This is a more canonical snakemake idiom, so easier to follow. And the expand(..., allow_missing=True) kwarg helps a lot with this, which in the past needed to use render_r1_r2
  • removed bigwig merging, which we never ended up using that much and which introduced additional complexity

Not sure yet how the lack of config object will work with the chipseq workflow, which has to do a lot of complex stuff to work out peak-calling and matching IP to input. Will need to play with that next.

Goal: reduce file count

Current system

Fastqc, fastq_screen, and average_bigwigs use a wrapper, which creates up to another 3 conda envs, which contribute 100k+ files toward total filesystem file count for each workflow.

New system

Don't use wrappers if possible; fastq_screen is getting removed due to the simplification of references; average_bigwigs is getting removed since it wasn't as helpful as I originally thought. Fastqc will be added directly as a rule.

Goal: use params as much as possible

Current system

params: are used infrequently and inconsistently. It's not clear when they should be used.

New system

All params that are not strictly required for the functionality of the rule have been moved to the params: directive, most commonly as something like params: extra="--flags here". This will ensure that snakemake picks up params changes and reruns rules appropriately.

Brandon Fuller and others added 30 commits October 10, 2024 15:06
- Removed the unused r1-only=False parameter in the render_r1_r2() function in both the rnaseq and chipseq Snakefiles\n- Changed the name of 'r1_only' function to 'render_r1_only' in both Snakefiles to make the name more intuitive and updated the rest of the files accordingly
Fixed render_r1_r2 function(s) in Snakefiles
Move `strand_arg` assignment from the `run` block to the `params` block so that `--rerun-trigger` will detect changes to strandedness configuration and re-run those rules
Add lib.postprocess.utils.extract_from_zip function, used for extracting -- and then immediately gzipping -- a file from
within a downloaded zip.

Include reference config for Plodia interpunctella
Just a small mistake

Co-authored-by: Ryan Dale <[email protected]>
* Change SRA fastq directory

Change the directory where SRA fastq files are downloaded and add the
'orig_filename' column to the config object for each sample so that the rest of the workflow works correctly

* Make code more elegant

Change a nested for-loop implementation in patters_targets.py to a more elegant one-line solution and clean up some code in Snakefile

* improve helper.fill_patterns

add check when combining by `zip` to ensure values are all same length

add more doctests

---------

Co-authored-by: Ryan Dale <[email protected]>
ucsc might be blocking circle-ci given the licenseing requirements of
chainfiles
this tool was last updated in 2017, and has incompatibilites with recent
numpy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants