[DRAFT] refactor #428

daler · 2025-01-03T20:53:11Z

This is an experimentation branch.

Goal: simplify reference handling

Current system

The current system has reference configs with fairly complex configuration. There is capability for recursively importing them, and the main functionality they afford is the ability to extremely easily configure fastq-screen runs, and effectively built-in library of references for a wide range of model organisms.

However, there are layers of abstraction that make it hard to troubleshoot or even understand:

the mismatch between references_dict (for easier computer use) and the actual config file (for easier human use)
the super awkward way of accessing references (c.refdict[c.organism][config['gtf']['tag']]['annotation'])
the way configs are handled across snakefiles (sometimes file, sometimes dict)
the need to manually specify indexes and conversions
the initial goal of being able to run the references workflow as a standalone to build everything at once in a central location isn't as useful as I thought it would be
a central references location worked great for a while. But over time, there are tricky things with same genome version but different versions of aligner, which implies being able to track organism, tag, and version of aligner that made the index. This is a hard problem to solve and will need more complexity.

New system

Every experiment will build its own references rather than use a central location. This takes more time and a little more space, but it ensures that each project can remain independent of all others.
The postprocessing will remain, but instead of using reference configs, the suggested postprocessing will be maintained in documentation rather than ever-growing reference config yamls.
include/reference_configs will be removed, and migrated to docs

Goal: more modular workflows

Recent snakemake versions support using modules; refactor as much as possible across rnaseq and chipseq in to consolidated modules (that can then have their inputs/outputs modified).

Current system

ChIP-seq and RNA-seq workflows have some common parts (references) factored out, but still have others (fastqc) that are repeated.

New system

~~Add as much as possible to rules/*.smk files, and import those as modules. Still playing around with how much exactly to put in the separate rules files.~~

Notes upon trying this out:

I tried moving all rules into rules/*/*.smk files as modules to be imported with possibly-overridden inputs/outputs. This does not work cleanly because the global is_paired and patterns are unavailable, unless we re-import them into every module which would get messy. And I think their utility in the global namespace is too great to get rid of. So I made the rules in the module more generic, overriding their inputs and outputs in the main snakefile that imported them. But this also means overriding the params (typically usesis_paired) and resources (utils.autobump), and by the time you do that, you've largely re-implemented everything except the shell: or run: block. So the end result is that you just made things more confusing by splitting the rule across two files.

So then I tried moving individual rules into separate .smk files to be used with include:, much like this apparently canonical example. But then it wasn't clear how to split up the various rules. I tried these two schemes:

rules/qc/fastqc.smk
rules/qc/rseqc.smk
rules/qc/preseq.smk
rules/aligners/star.smk
...etc

Or

rules/qc.smk.  # has fastqc, rseqc, preseq, etc etc
rules/aligners.smk. # has hisat2, star, etc
...etc

Neither of these felt like it was a net benefit. Storing everything in separate files and include:ing them cleans up the main snakefile tremendously of course. But by fragmenting the rules into files like that, you lose all of the visual consistency of seeing patterns[] and autobump and other commonalities across the rules, which is otherwise a signal that those things are useful to understand and use. It's just much harder to get a cohesive sense of the workflow, unless you open up all of those separate .smk files.

The refactor-separate-smk branch has a (very messy and incomplete) prototype of that, just to get a feel for it. After trying it out, I feel like there is still a benefit of having everything together in the same file as in the original design.

And it turns out that modules are not as useful as I was originally thinking.

Goal: reduce complexity where possible

Current system

The config objects hide a lot of behavior that is unclear. Better to use standard snakemake conventions where possible.

New system

I got this working for the RNA-seq workflow:

Load configfile directly -- removing reference complexity means we can just use a plain ol' config yaml.
Store global is_paired, is_sra, and sampletable
Use patterns because it's very convenient (and I have plans for using that patterns yaml for reporting later). It's a pretty clear mapping of "thing written in yaml file equals thing in snakemake". But don't use targets because that's much less clear where they come from. Instead, use expand() on patterns. This is a more canonical snakemake idiom, so easier to follow. And the expand(..., allow_missing=True) kwarg helps a lot with this, which in the past needed to use render_r1_r2
removed bigwig merging, which we never ended up using that much and which introduced additional complexity

Not sure yet how the lack of config object will work with the chipseq workflow, which has to do a lot of complex stuff to work out peak-calling and matching IP to input. Will need to play with that next.

Goal: reduce file count

Current system

Fastqc, fastq_screen, and average_bigwigs use a wrapper, which creates up to another 3 conda envs, which contribute 100k+ files toward total filesystem file count for each workflow.

New system

Don't use wrappers if possible; fastq_screen is getting removed due to the simplification of references; average_bigwigs is getting removed since it wasn't as helpful as I originally thought. Fastqc will be added directly as a rule.

Goal: use params as much as possible

Current system

params: are used infrequently and inconsistently. It's not clear when they should be used.

New system

All params that are not strictly required for the functionality of the rule have been moved to the params: directive, most commonly as something like params: extra="--flags here". This will ensure that snakemake picks up params changes and reruns rules appropriately.

- Removed the unused r1-only=False parameter in the render_r1_r2() function in both the rnaseq and chipseq Snakefiles\n- Changed the name of 'r1_only' function to 'render_r1_only' in both Snakefiles to make the name more intuitive and updated the rest of the files accordingly

Fixed render_r1_r2 function(s) in Snakefiles

Move `strand_arg` assignment from the `run` block to the `params` block so that `--rerun-trigger` will detect changes to strandedness configuration and re-run those rules

Make strand_arg a param

Add lib.postprocess.utils.extract_from_zip function, used for extracting -- and then immediately gzipping -- a file from within a downloaded zip. Include reference config for Plodia interpunctella

Just a small mistake Co-authored-by: Ryan Dale <[email protected]>

* Change SRA fastq directory Change the directory where SRA fastq files are downloaded and add the 'orig_filename' column to the config object for each sample so that the rest of the workflow works correctly * Make code more elegant Change a nested for-loop implementation in patters_targets.py to a more elegant one-line solution and clean up some code in Snakefile * improve helper.fill_patterns add check when combining by `zip` to ensure values are all same length add more doctests --------- Co-authored-by: Ryan Dale <[email protected]>

ucsc might be blocking circle-ci given the licenseing requirements of chainfiles

this tool was last updated in 2017, and has incompatibilites with recent numpy.

Brandon Fuller and others added 30 commits October 10, 2024 15:06

Merge branch 'master' into fix_406

dfa2353

add newline back in

d0a0300

Merge pull request #413 from lcdb/fix_406

f7e98ef

Fixed render_r1_r2 function(s) in Snakefiles

Make strand_arg a param

7d92555

Move `strand_arg` assignment from the `run` block to the `params` block so that `--rerun-trigger` will detect changes to strandedness configuration and re-run those rules

Merge pull request #415 from lcdb/fix_issue_405

d168bb0

Make strand_arg a param

add Plodia interpunctella reference config (#417)

b808d10

Add lib.postprocess.utils.extract_from_zip function, used for extracting -- and then immediately gzipping -- a file from within a downloaded zip. Include reference config for Plodia interpunctella

Update plotting.R (#423)

47b379a

Just a small mistake Co-authored-by: Ryan Dale <[email protected]>

mambaforge -> miniforge

a2e5448

latest ubuntu for testing

4487d90

https for downloading chainfile

836fff0

noninteractive apt install

09dedd7

noninteractive apt install

b6c663a

debug url

2fc5d71

for test "external" data, do not do liftover

a470398

ucsc might be blocking circle-ci given the licenseing requirements of chainfiles

remove support for GAT

ed9161d

this tool was last updated in 2017, and has incompatibilites with recent numpy.

GAT no longer used, remove from requirements

becdf21

don't pin python

dfaec3e

pin snakemake >8

b65f4cd

update env.yml

0f81f07

update snakefiles and lib to reflect changes in snakemake 8

f039b64

rm --bias for kallisto, which was causing segfaults

bec163d

update test args -r --> --reason for snakemake 8

cc310fb

rm --reason for snakemake 8

54514e9

disable colocalization workflow

06c147b

delete lots of stuff

bea0910

add new references.smk

060c2f8

simplify config

8bb7398

utils, common, and helpers are all now in utils

79081fd

daler added 30 commits January 10, 2025 23:09

add bed_to_bigbed as script

595eddf

add peakcallers to requirements.txt

95cefea

clean up log handling for epic2

c357763

test settings overhaul

52ac28a

comment sampletable

9024aa6

various rnaseq fixes

2483b98

chipseq overhaul and simplification

227646c

clean up some tests

4136207

convert rrna table to script

d7bb492

fix test on preprocessor

66f5a11

updated env yaml

bfdbf5e

fix import

a466da0

fix strand check

eb68925

split featurecounts

8f33026

all sorts of fixes and cleanup

39209ce

sra for chipseq

155307a

clean out test suite

fd1c1c3

add strandcheck back to snakefile

d322e33

don't use patterns any more

8b6b52a

snakefmt cleanup

d5799fa

rrna_libsizes_table script avoids utils

da2fc32

use mem and disk rather than mem_mb and disk_mb

b049ef6

convert to mem and disk in references

650e60f

spell out params fully in wrapper

d5db4a5

timestamped log file for slurm wrapper

b3a7d94

rm wrappers

aa437be

resources to strings

9f00366

rm chipseq patterns

65d2e3b

update chipseq_trackhub.py

3b57a27

update rnaseq_trackhub.py

4e86e16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] refactor #428

[DRAFT] refactor #428

daler commented Jan 3, 2025 •

edited

Loading

[DRAFT] refactor #428

Are you sure you want to change the base?

[DRAFT] refactor #428

Conversation

daler commented Jan 3, 2025 • edited Loading

Goal: simplify reference handling

Current system

New system

Goal: more modular workflows

Current system

New system

Goal: reduce complexity where possible

Current system

New system

Goal: reduce file count

Current system

New system

Goal: use params as much as possible

Current system

New system

daler commented Jan 3, 2025 •

edited

Loading