-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] refactor #428
Draft
daler
wants to merge
79
commits into
master
Choose a base branch
from
refactor
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[DRAFT] refactor #428
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Removed the unused r1-only=False parameter in the render_r1_r2() function in both the rnaseq and chipseq Snakefiles\n- Changed the name of 'r1_only' function to 'render_r1_only' in both Snakefiles to make the name more intuitive and updated the rest of the files accordingly
Fixed render_r1_r2 function(s) in Snakefiles
Move `strand_arg` assignment from the `run` block to the `params` block so that `--rerun-trigger` will detect changes to strandedness configuration and re-run those rules
Make strand_arg a param
Add lib.postprocess.utils.extract_from_zip function, used for extracting -- and then immediately gzipping -- a file from within a downloaded zip. Include reference config for Plodia interpunctella
Just a small mistake Co-authored-by: Ryan Dale <[email protected]>
* Change SRA fastq directory Change the directory where SRA fastq files are downloaded and add the 'orig_filename' column to the config object for each sample so that the rest of the workflow works correctly * Make code more elegant Change a nested for-loop implementation in patters_targets.py to a more elegant one-line solution and clean up some code in Snakefile * improve helper.fill_patterns add check when combining by `zip` to ensure values are all same length add more doctests --------- Co-authored-by: Ryan Dale <[email protected]>
ucsc might be blocking circle-ci given the licenseing requirements of chainfiles
this tool was last updated in 2017, and has incompatibilites with recent numpy.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an experimentation branch.
Goal: simplify reference handling
Current system
The current system has reference configs with fairly complex configuration. There is capability for recursively importing them, and the main functionality they afford is the ability to extremely easily configure fastq-screen runs, and effectively built-in library of references for a wide range of model organisms.
However, there are layers of abstraction that make it hard to troubleshoot or even understand:
c.refdict[c.organism][config['gtf']['tag']]['annotation']
)New system
include/reference_configs
will be removed, and migrated to docsGoal: more modular workflows
Recent snakemake versions support using modules; refactor as much as possible across rnaseq and chipseq in to consolidated modules (that can then have their inputs/outputs modified).
Current system
ChIP-seq and RNA-seq workflows have some common parts (references) factored out, but still have others (fastqc) that are repeated.
New system
Add as much as possible torules/*.smk
files, and import those as modules. Still playing around with how much exactly to put in the separate rules files.Notes upon trying this out:
I tried moving all rules into
rules/*/*.smk
files as modules to be imported with possibly-overridden inputs/outputs. This does not work cleanly because the globalis_paired
andpatterns
are unavailable, unless we re-import them into every module which would get messy. And I think their utility in the global namespace is too great to get rid of. So I made the rules in the module more generic, overriding their inputs and outputs in the main snakefile that imported them. But this also means overriding the params (typically usesis_paired
) and resources (utils.autobump
), and by the time you do that, you've largely re-implemented everything except theshell:
orrun:
block. So the end result is that you just made things more confusing by splitting the rule across two files.So then I tried moving individual rules into separate
.smk
files to be used withinclude:
, much like this apparently canonical example. But then it wasn't clear how to split up the various rules. I tried these two schemes:Or
Neither of these felt like it was a net benefit. Storing everything in separate files and
include:
ing them cleans up the main snakefile tremendously of course. But by fragmenting the rules into files like that, you lose all of the visual consistency of seeingpatterns[]
andautobump
and other commonalities across the rules, which is otherwise a signal that those things are useful to understand and use. It's just much harder to get a cohesive sense of the workflow, unless you open up all of those separate.smk
files.The
refactor-separate-smk
branch has a (very messy and incomplete) prototype of that, just to get a feel for it. After trying it out, I feel like there is still a benefit of having everything together in the same file as in the original design.And it turns out that modules are not as useful as I was originally thinking.
Goal: reduce complexity where possible
Current system
The config objects hide a lot of behavior that is unclear. Better to use standard snakemake conventions where possible.
New system
I got this working for the RNA-seq workflow:
is_paired
,is_sra
, andsampletable
patterns
because it's very convenient (and I have plans for using that patterns yaml for reporting later). It's a pretty clear mapping of "thing written in yaml file equals thing in snakemake". But don't usetargets
because that's much less clear where they come from. Instead, useexpand()
on patterns. This is a more canonical snakemake idiom, so easier to follow. And theexpand(..., allow_missing=True)
kwarg helps a lot with this, which in the past needed to userender_r1_r2
Not sure yet how the lack of config object will work with the chipseq workflow, which has to do a lot of complex stuff to work out peak-calling and matching IP to input. Will need to play with that next.
Goal: reduce file count
Current system
Fastqc, fastq_screen, and average_bigwigs use a wrapper, which creates up to another 3 conda envs, which contribute 100k+ files toward total filesystem file count for each workflow.
New system
Don't use wrappers if possible; fastq_screen is getting removed due to the simplification of references; average_bigwigs is getting removed since it wasn't as helpful as I originally thought. Fastqc will be added directly as a rule.
Goal: use params as much as possible
Current system
params:
are used infrequently and inconsistently. It's not clear when they should be used.New system
All params that are not strictly required for the functionality of the rule have been moved to the
params:
directive, most commonly as something likeparams: extra="--flags here"
. This will ensure that snakemake picks up params changes and reruns rules appropriately.