Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config System YAML #9

Open
jfear opened this issue May 20, 2016 · 4 comments
Open

Config System YAML #9

jfear opened this issue May 20, 2016 · 4 comments

Comments

@jfear
Copy link
Contributor

jfear commented May 20, 2016

Starting to think about the config system YAML.

#### General Settings ####
settings: # location of system level settings
    title: My Very Cool Project
    Author: Bob
    data: /data/bob/original_data # path like settings
    python2: py2.7 # conda environment names
    env: HOME # names to access specific envs by 

#### Experiment Level Settings ####
exp.settings: # experiment level settings, settings that apply to all samples
    sampleinfo: sample_metadata.csv # Sample information relating sample specific settings to sample ids
    fastq_suffix: '.fastq.gz'  # it would be nice to be able define a setting here that applies to all samples, or define for each sample in the sampleinfo table case they are different. 
    annotation: # Need to some way to specify annotation to use, maybe here is not the best place.
        genic: /data/...
        transcript: /data/....
        intergenic: /data/...
    models: # add modeling information here
        formula: ~ sex + tissue + time
        factors: # tell which columns in sample table should be treated like factors
             - sex
             - tissue
             - time

#### Workflow Settings ####
# I think using a naming scheme that follow folder structure would be useful. For example:
# if there is a workflows folder then we would have
workflows.qc: # could define workflow specific settings
    steps_to_run: # List pieces of the pipeline to run, (or not run may be better)
        - fastqc
        - rseqc
    trim: True # or could have logical operators switches to change workflow behavior

workflows.align:
    aligner: 'tophat2=2.1.0' # define what software to use and optionally what version
    aggregated_output_dir: /data/...
    report_output_dir: /data/...

workflow.rnaseq: ...

workflows.references: ... 

#### Rule Specific Settings ####
rules.align.bowtie2: # rule level settings again with naming based on folder structure if we need folder structure
    cluster: # It would be nice to be able to have cluster settings with rule setting, can't think of a way to get this to work, probably just need a separate cluster config.
        threads: 16
        mem: 60g
        walltime: 8:00:00
    index: /data/... # bowtie index prefix
    params: # Access to any parameters that need set
        options: -p 16 -k 8 # place to change the options
    aln_suffix: '.bt2.bam'  # place to change how files are named
    log_suffix: '.bt2.log'
@daler
Copy link
Contributor

daler commented May 21, 2016

I was thinking about this some more. I tried #10 as a way of using the github code review tools to help discussion, but figured I'd just post here.

I really like having the One True Config split by workflows. I made some mostly organizational changes to what you have above:

  • Moved the per-rule config under a "rules" key in the respective workflow, so that the nesting of the config follows the nesting of the rules within workflows. It also allows for per-workflow configuration if a rule is used in multiple places (e.g., a bowtie rule in qc and a bowtie rule in align workflow).
  • Moved the rna-seq-specific stuff (models, factors) to workflows.rnaseq.
  • Added ability to specify multiple models; how much this will be used in practice, whether we should even expose this sort of complexity rather than just build templates for custom work, or what the particular format will be, remains to be figured out.
  • added an assembly key to the renamed global section
  • removed the fastq suffix, see below for discussion
global:
  title: My Very Cool Project
  Author: Bob
  assembly: dm6
  sampleinfo: sample_metadata.csv

workflows.qc:
  rules.trim:
    adapters: adapters.fa
    extra: "-q 20"

workflows.align:
  rules:
    align:
      aligner: 'bowtie==2.0.2'
      index: /data/...
      cluster:
        threads: 16
        mem: 60g
        walltime: 8:00:00
      aln_suffix: '.bt2.bam'
      log_suffix: '.bt2.log'
      extra: "-p {threads} -k 8"

workflow.rnaseq:
  factors:
    - sex
    - tissue
    - time
  models:
    full_model: ~ sex + tissue + time
    reduced_1: ~ sex + tissue

  rules:

    featurecounts:
      annotation: /data/gene.gtf
      extra: "-s 1"

    featurecounts_intergenic:
      annotation: /data/intergenic.gtf

config lookups

Specifying so much in the config will let us write some pretty generic workflows where input, output and params are basically just a ton of config dict lookups.

rule align:
    input:
        index=config['workflows.align']['rules']['align']['index']
    threads: config['workflows.align']['rules']['align']['cluster']['threads']
   ...

Some options to think about: if we wrap the config in an object with dotted access, then it becomes slightly more readable:

rule align:
    input:
        index=config.workflows_align.rules.align.index
    threads: config.workflows_align.rules.align.cluster.threads
   ...

Or syntax like the conda_build Metadata object,

rule align:
    input:
        index=config.get('workflows.align/rules/align/index')
    threads: config.get('workflows.align/rules/align/cluster/threads')
   ...

cluster config

I really like having the cluster config specified here alongside the rule. It could work if we provide a wrapper for calling snakemake that passes through most arguments, but extracts the cluster config info from the config file and builds a tempfile cluster_config.yaml that is passed to snakemake.

The threads configured here can be injected into the rules at the end of the workflow by modifying [rule.threads for rule in workflow.rules]

@jfear
Copy link
Contributor Author

jfear commented May 21, 2016

I like the re-organization, the nesting cleans things up a bit. I think the dot notation lookups seem the cleanest.

Sense the wrapper system is pulling the complexity out of the rules. I am thinking that the "workflow" should contain all of its own rules and try to make all of the settings some sort of lookup from the global config.

I also like having the cluster config side-by-side.

I will look at #10 and make individual comments there. Did not know you could do the line by line comments with PRs.

@daler
Copy link
Contributor

daler commented May 21, 2016

Seems like an elegant option for the dot notation from http://stackoverflow.com/a/7534478. Given the function:

def cfg(val):
    current_data = config
    for chunk in val.split('.'):
        current_data = current_data.get(chunk, {})
    return current_data

the lookup becomes:

rule align:
    input:
        index=cfg('workflows_align.rules.align.index')
    threads: cfg('workflows_align.rules.align.cluster.threads')
   ...

The reason I like this is that the global config dict remains unchanged as a dict. The other answers in that stackoverflow question have other options, but I worry about converting the global config dict to something else in case snakemake is using it for other things we don't know about that assume the full dict interface.

@daler
Copy link
Contributor

daler commented May 21, 2016

Also we really should have config validation once things settle down into a format. For example, we could keep a validation schema file that includes default values, then have code to build an example config using that schema and validates the generated config. The user edits that config, which is then validated again before use.

Luckily I have existing code for exactly this! I'll port it over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants