Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic parallelization of rule execution #99

Open
rvosa opened this issue May 2, 2024 · 0 comments
Open

Dynamic parallelization of rule execution #99

rvosa opened this issue May 2, 2024 · 0 comments

Comments

@rvosa
Copy link
Member

rvosa commented May 2, 2024

As indicated in the DAG, numerous steps in the pipeline can be trivially parallelized, allowing for horizontal scaling. However, the current implementation for this is based on the scattergather compute model, which needs to be told ahead of time how many parallel processes are going to be involved. The number of processes is specified in the config file on the basis of the number of distinct taxonomic families in the input set (e.g.: the order Primates has 17 families, which is entered in the config file under nfamilies and from there ends up in the Snakefile). This is an awkward that users tend to get wrong, hence a dynamic solution where the pipeline learns the parallelization strategy from the input data would be better. However, there are some complication:

  • using the dynamic construct in recent versions of SnakeMake appears to interfere with the ability to generate a DAG, which is one of the requirements for submission to WorkFlowHub
  • the number of families must be learned from the input data in combination with the applicable marker gene, i.e. a simple cut | sort | uniq | wc -l (or similar) approach will be error prone
  • within the full data set, there's a small set of families (<10) whose size may exceed the capacity of the subtree inference step, meaning that those families may have to be partitioned at subfamily or genus level, increasing the number of parallel processes

This is considered 'done' when users no longer have to defined scatter/gather parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant