Fix config (#119)

* remove split fastq from config and all rules * clean up config and fix spelling of indices * remove test config files * update default heatmap options * return to config but with small changes * refactor config.py * fix typo in config.py * use indices consistently for genome indices * update config process in docs
alsmith151 · Jan 26, 2024 · 1a5f082 · 1a5f082
1 parent abee85a
commit 1a5f082
Show file tree

Hide file tree

Showing 17 changed files with 157 additions and 753 deletions.
diff --git a/docs/pipeline.md b/docs/pipeline.md
@@ -9,82 +9,45 @@ The pipeline is configured using a YAML file: e.g. `config_atac.yml`, `config_ch
 The following command will generate the working directory and configuration file for the ATAC-seq pipeline:
 
 ```bash
-seqnado-config atac
+seqnado-config chip
 ```
 
 You should get somthing like this:
 
 ```bash
 $ seqnado-config chip
-  [1/23] user_name (Your name): asmith
-  [2/23] Select date
-    1 - 2024-01-13
-    Choose from [1] (1):
-  [3/23] project_name (Project name): TEST
-  [4/23] Select project_id
-    1 - test
-    Choose from [1] (1): 1
-  [5/23] genome (hg38):
-  [6/23] chromosome_sizes (/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/sequence/hg38.chrom.sizes):
-  [7/23] indicies (/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/bt2_index/hg38):
-  [8/23] gtf (/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/genes/hg38.ncbiRefSeq.gtf):
-  [9/23] Select read_type
-    1 - paired
-    2 - single
-    Choose from [1/2] (1): 1
-  [10/23] Select split_fastq
-    1 - True
-    2 - False
-    Choose from [1/2] (1): 2
-  [11/23] split_fastq_parts (int):
-  [12/23] Select remove_pcr_duplicates_method
-    1 - picard
-    2 - deeptools
-    Choose from [1/2] (1): 1
-  [13/23] Select remove_blacklist
-    1 - yes
-    2 - no
-    Choose from [1/2] (1): 1
-  [14/23] blacklist (/ceph/project/milne_group/shared/seqnado_reference/hg38/hg38-blacklist.v2.bed.gz):
-  [15/23] Select make_bigwigs
-    1 - yes
-    2 - no
-    Choose from [1/2] (1): 1
-  [16/23] Select pileup_method
-    1 - deeptools
-    2 - homer
-    Choose from [1/2] (1): 1
-  [17/23] Select make_heatmaps
-    1 - yes
-    2 - no
-    Choose from [1/2] (1): 1
-  [18/23] Select call_peaks
-    1 - yes
-    2 - no
-    Choose from [1/2] (1): 1
-  [19/23] Select peak_calling_method
-    1 - macs
-    2 - lanceotron
-    3 - homer
-    Choose from [1/2/3] (1): 2
-  [20/23] Select make_ucsc_hub
-    1 - yes
-    2 - no
-    Choose from [1/2] (1): 1
-  [21/23] UCSC_hub_directory (path/to/ publically accessible location on the server): /project/milne_group/datashare/asmith/chipseq/TEST_HUB
-  [22/23] email (Email address (UCSC required)): [email protected]
-  [23/23] Select color_by
-    1 - samplename
-    2 - method
-    Choose from [1/2] (1): 1
+  What is your project name? [cchahrou_project]: TEST
+  What is your genome name? [other]: hg38
+  Path to Bowtie2 genome indices: [None]: /ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/bt2_index/hg38
+  Path to chromosome sizes file: [None]: /ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/sequence/hg38.chrom.sizes
+  Path to GTF file: [None]: /ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/genes/hg38.ncbiRefSeq.gtf
+  Path to blacklist bed file: [None]: /ceph/project/milne_group/shared/seqnado_reference/hg38/hg38-blacklist.v2.bed.gz
+  Do you want to remove blacklist regions? (yes/no) [yes]: yes
+  Remove PCR duplicates? (yes/no) [yes]: yes
+  Remove PCR duplicates method: [picard]: picard
+  Do you have spikein? (yes/no) [no]: yes
+  Normalisation method: [orlando/with_input]: orlando
+  Reference genome: [hg38]: hg38
+  Spikein genome: [dm6]: dm6
+  Path to fastqscreen config: [/ceph/project/milne_group/shared/seqnado_reference/fastqscreen_reference/fastq_screen.conf]: /ceph/project/milne_group/shared/seqnado_reference/fastqscreen_reference/fastq_screen.conf
+  Do you want to make bigwigs? (yes/no) [no]: yes
+  Pileup method: [deeptools/homer]: deeptools
+  Do you want to make heatmaps? (yes/no) [no]: yes
+  Do you want to call peaks? (yes/no) [no]: yes
+  Peak caller: [lanceotron/macs/homer]: lanceotron
+  Do you want to make a UCSC hub? (yes/no) [no]: yes
+  UCSC hub directory: [/path/to/ucsc_hub/]: /project/milne_group/datashare/etc
+  What is your email address? [[email protected]]: email for UCSC
+  Color by (for UCSC hub): [samplename]: samplename
+  Directory '2024-01-26_chip_TEST' has been created with the 'config_chip.yml' file.
 ```
 
 This will generate the following files:
 
 ```bash
-$ tree 2024-01-13_test/
+$ tree 2024-01-13_chip_test/
 
-2024-01-13_test/
+2024-01-13_chip_test/
 ├── config_chip.yml
 └── readme_test.md
 
@@ -230,6 +193,13 @@ $ ls -l
 
 ```bash
 tmux new -s NAME_OF_SESSION
+
+# or 
+
+screen -S NAME_OF_SESSION
+
+# to exit screen session
+  ctrl+a d 
 ```
 
 

diff --git a/seqnado/config.py b/seqnado/config.py
@@ -18,7 +18,6 @@ def get_user_input(prompt, default=None, is_boolean=False, choices=None):
         return user_input
 
 
-
 def setup_configuration(assay, genome, template_data):
     username = os.getenv('USER', 'unknown_user')
     today = datetime.datetime.now().strftime('%Y-%m-%d')
@@ -40,86 +39,55 @@ def setup_configuration(assay, genome, template_data):
 
     if genome == "other":
         genome = get_user_input("What is your genome name?", default="other")
-        if assay in ["chip", "atac"]:
-            genome_dict = {
-                genome: {
-                    "index": get_user_input("Path to Bowtie2 genome index:"),
-                    "chromosome_sizes": get_user_input("Path to chromosome sizes file:"),
-                    "gtf": get_user_input("Path to GTF file:"),
-                    "blacklist": get_user_input("Path to blacklist bed file:")
-                }
+        genome_dict = {
+            genome: {
+                "indices": get_user_input("Path to Bowtie2 genome indices:") if assay in ["chip", "atac"] else get_user_input("Path to STAR v2.7.10b genome indices:"),
+                "chromosome_sizes": get_user_input("Path to chromosome sizes file:"),
+                "gtf": get_user_input("Path to GTF file:"),
+                "blacklist": get_user_input("Path to blacklist bed file:")
             }
-        elif assay == "rna":
-            genome_dict = {
-                genome: {
-                    "index": get_user_input("Path to STAR v2.7.10b genome index:"),
-                    "chromosome_sizes": get_user_input("Path to chromosome sizes file:"),
-                    "gtf": get_user_input("Path to GTF file:"),
-                    "blacklist": get_user_input("Path to blacklist bed file:")
-                }
+        }
+    else:
+        if genome in genome_values:
+            genome_dict[genome] = {
+                "indices": genome_values[genome].get('bt2_indices' if assay in ["chip", "atac"] else 'star_indices', ''),
+                "chromosome_sizes": genome_values[genome].get('chromosome_sizes', ''),
+                "gtf": genome_values[genome].get('gtf', ''),
+                "blacklist": genome_values[genome].get('blacklist', '')
             }
 
-    elif genome in genome_values:
-        if assay in ["chip", "atac"]:
-            genome_dict = {
-                genome: {
-                    "index": genome_values[genome]['bt2_index'],
-                    "chromosome_sizes": genome_values[genome]['chromosome_sizes'],
-                    "gtf": genome_values[genome]['gtf'],
-                    "blacklist": genome_values[genome]['blacklist']
-                }
-            }
-        elif assay == "rna":
-            genome_dict = {
-                genome: {
-                    "index": genome_values[genome]['star_index'],
-                    "chromosome_sizes": genome_values[genome]['chromosome_sizes'],
-                    "gtf": genome_values[genome]['gtf'],
-                    "blacklist": genome_values[genome]['blacklist']
-                }
-            }
+
+    genome_config = {
+        'genome': genome,
+        'indices': genome_dict[genome]['indices'],
+        'chromosome_sizes': genome_dict[genome]['chromosome_sizes'],
+        'gtf': genome_dict[genome]['gtf'],
+    }
+    template_data.update(genome_config)
 
-    template_data['genome'] = genome
-    template_data['indicies'] = genome_dict[genome]['index']
-    template_data['chromosome_sizes'] = genome_dict[genome]['chromosome_sizes']
-    template_data['gtf'] = genome_dict[genome]['gtf']
-    template_data['read_type'] = get_user_input("What is your read type?", default="paired", choices=["paired", "single"])
 
     template_data['remove_blacklist'] = get_user_input("Do you want to remove blacklist regions? (yes/no)", default="yes", is_boolean=True)
     if template_data['remove_blacklist']:
         template_data['blacklist'] = genome_dict[genome]['blacklist']
 
-    if assay in ["chip", "atac"]:
-        template_data['remove_pcr_duplicates'] = get_user_input("Remove PCR duplicates? (yes/no)", default="yes", is_boolean=True)
-    elif assay == "rna":
-        template_data['remove_pcr_duplicates'] = get_user_input("Remove PCR duplicates? (yes/no)", default="no", is_boolean=True)
-
+    template_data['remove_pcr_duplicates'] = get_user_input("Remove PCR duplicates? (yes/no)", default= "yes" if assay in ["chip", "atac"] else "no", is_boolean=True)
     if template_data['remove_pcr_duplicates']:
         template_data['remove_pcr_duplicates_method'] = get_user_input("Remove PCR duplicates method:", default="picard", choices=["picard"])
 
     else:
         template_data['remove_pcr_duplicates_method'] = "False"
 
     if assay == "atac":
-        template_data['shift_atac_reads'] = get_user_input("Shift ATAC-seq reads? (yes/no)", default="yes", is_boolean=True)
-    elif assay in ["chip", "rna"]:
-        template_data['shift_atac_reads'] = "False"
+        template_data['shift_atac_reads'] = get_user_input("Shift ATAC-seq reads? (yes/no)", default="yes", is_boolean=True) if assay == "atac" else "False"
 
     if assay == "chip":
-        template_data['spikein'] = get_user_input("Do you have spikein? (yes/no)", default="no", is_boolean=True)
+        template_data['spikein'] = get_user_input("Do you have spikein? (yes/no)", default="no", is_boolean=True) 
         if template_data['spikein']:
                 template_data['normalisation_method'] = get_user_input("Normalisation method:", default="orlando", choices=["orlando", "with_input"])
                 template_data['reference_genome'] = get_user_input("Reference genome:", default="hg38")
                 template_data['spikein_genome'] = get_user_input("Spikein genome:", default="dm6")
                 template_data['fastq_screen_config'] = get_user_input("Path to fastqscreen config:", default="/ceph/project/milne_group/shared/seqnado_reference/fastqscreen_reference/fastq_screen.conf")
-    elif assay in ["atac", "rna"]:
-        template_data['normalisation_method'] = "False"
-
-    template_data['split_fastq'] = get_user_input("Do you want to split FASTQ files? (yes/no)", default="no", is_boolean=True)
-    if template_data['split_fastq']:
-        template_data.update['split_fastq_parts'] = get_user_input("How many parts do you want to split the FASTQ files into?", default="4")
-
-
+
     template_data['make_bigwigs'] = get_user_input("Do you want to make bigwigs? (yes/no)", default="no", is_boolean=True)
     if template_data['make_bigwigs']:
         template_data['pileup_method'] = get_user_input("Pileup method:", default="deeptools", choices=["deeptools", "homer"])
@@ -129,29 +97,16 @@ def setup_configuration(assay, genome, template_data):
         template_data['call_peaks'] = get_user_input("Do you want to call peaks? (yes/no)", default="no", is_boolean=True)
         if template_data['call_peaks']:
             template_data['peak_calling_method'] = get_user_input("Peak caller:", default="lanceotron", choices=["lanceotron", "macs", "homer"])
-
-    elif assay == "rna":
-        template_data['call_peaks'] = "False"
 
-    if assay == "rna":
-        template_data['run_deseq2'] = get_user_input("Run DESeq2? (yes/no)", default="no", is_boolean=True)
-    elif assay in ["chip", "atac"]:
-        template_data['run_deseq2'] = "False"
+    template_data['run_deseq2'] = get_user_input("Run DESeq2? (yes/no)", default="no", is_boolean=True) if assay == "rna" else "False"
 
     template_data['make_ucsc_hub'] = get_user_input("Do you want to make a UCSC hub? (yes/no)", default="no", is_boolean=True)
-    if template_data['make_ucsc_hub']:
-        template_data['UCSC_hub_directory'] = get_user_input("UCSC hub directory:", default="/path/to/ucsc_hub/")
-        template_data['email'] = get_user_input("What is your email address?", default=f"{username}@example.com")
-        template_data['color_by'] = get_user_input("Color by (for UCSC hub):", default="samplename")
-    else :
-        template_data['UCSC_hub_directory'] = "."
-        template_data['email'] = f"{username}@example.com"
-        template_data['color_by'] = "samplename"
-
-    if assay in ["chip", "atac"]:
-        template_data['options'] = TOOL_OPTIONS
-    elif assay == "rna":
-        template_data['options'] = TOOL_OPTIONS_RNA
+
+    template_data['UCSC_hub_directory'] = get_user_input("UCSC hub directory:", default="/path/to/ucsc_hub/") if template_data['make_ucsc_hub'] else "."
+    template_data['email'] = get_user_input("What is your email address?", default=f"{username}@example.com") if template_data['make_ucsc_hub'] else f"{username}@example.com"
+    template_data['color_by'] = get_user_input("Color by (for UCSC hub):", default="samplename") if template_data['make_ucsc_hub'] else "samplename"
+
+    template_data['options'] = TOOL_OPTIONS_RNA if assay == "rna" else TOOL_OPTIONS
 
 
 # Tool Specific Options
@@ -250,4 +205,3 @@ def create_config(assay, genome):
             file.write(template_deseq2.render(template_data))
 
     print(f"Directory '{dir_name}' has been created with the 'config_{assay}.yml' file.")
-
diff --git a/seqnado/utils.py b/seqnado/utils.py
@@ -109,9 +109,9 @@ def has_bowtie2_index(prefix: str) -> bool:
     path_dir = path_prefix.parent
     path_prefix_stem = path_prefix.stem
 
-    bowtie2_indicies = list(path_dir.glob(f"{path_prefix_stem}*.bt2"))
+    bowtie2_indices = list(path_dir.glob(f"{path_prefix_stem}*.bt2"))
 
-    if len(bowtie2_indicies) > 0:
+    if len(bowtie2_indices) > 0:
         return True
 
 

diff --git a/seqnado/workflow/config/config.yaml.jinja b/seqnado/workflow/config/config.yaml.jinja
@@ -10,12 +10,10 @@ design: "design.csv"
 
 genome:
     name: "{{genome}}"
-    indicies: "{{indicies}}"
+    indices: "{{indices}}"
     chromosome_sizes: "{{chromosome_sizes}}"
     gtf: "{{gtf}}"
 
-read_type: "{{read_type}}"
-
 remove_blacklist: "{{remove_blacklist}}"
 blacklist: "{{blacklist}}"
 
@@ -30,9 +28,6 @@ spikein_options:
     spikein_genome: "{{spikein_genome}}"
     fastq_screen_config: "{{fastq_screen_config}}"
 
-split_fastq: "{{split_fastq}}"
-split_fastq_parts: "{{split_fastq_parts}}"
-
 make_bigwigs: "{{make_bigwigs}}"  
 pileup_method: "{{pileup_method}}"
 make_heatmaps: "{{make_heatmaps}}"

diff --git a/seqnado/workflow/config/preset_genomes.json b/seqnado/workflow/config/preset_genomes.json
@@ -1,56 +1,56 @@
 {
     "dm6": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/bt2_index/dm6",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/STAR_2.7.10b",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/bt2_index/dm6",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/sequence/dm6.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/dm6/UCSC/genes/dm6.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/dm6/dm6-blacklist.v2.bed.gz"
     },
     "hg19": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/bt2_index/hg19",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/STAR_2.7.10b",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/bt2_index/hg19",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/sequence/hg19.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/hg19/UCSC/genes/hg19.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/hg19/hg19-blacklist.v2.bed.gz "
     },
     "hg38": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/bt2_index/hg38",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/STAR_2.7.10b",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/bt2_index/hg38",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/sequence/hg38.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/hg38/UCSC/genes/hg38.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/hg38/hg38-blacklist.v2.bed.gz"
     },
     "hg38_dm6": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/hg38_dm6/UCSC/bt2_index/hg38_dm6",
-        "star_index": "NA",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg38_dm6/UCSC/bt2_index/hg38_dm6",
+        "star_indices": "NA",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/hg38_dm6/UCSC/sequence/hg38_dm6.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/hg38_dm6/UCSC/genes/hg38_dm6.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/hg38_dm6/hg38_dm6-blacklist.v2.bed.gz"
     },
     "hg38_mm39": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/hg38_mm39/bt2_index/hg38_mm39",
-        "star_index": "NA",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg38_mm39/bt2_index/hg38_mm39",
+        "star_indices": "NA",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/hg38_mm39/sequence/hg38_mm39.fa.fai",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/hg38_mm39/genes/hg38_mm39.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/hg38_mm39/hg38_mm39-blacklist.bed.gz"
     },
     "hg38_spikein": {
-        "bt2_index": "NA",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/hg38_spikein/UCSC/STAR_2.7.10b",
+        "bt2_indices": "NA",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/hg38_spikein/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/hg38_spikein/hg38_spikein.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/hg38_spikein/UCSC/genes/hg38_spikein_transcripts.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/hg38/hg38-blacklist.v2.bed.gz"
     },
     "mm10": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/bt2_index/mm10",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/STAR_2.7.10b",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/bt2_index/mm10",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/sequence/mm10.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/mm10/UCSC/genes/mm10.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/mm10/mm10-blacklist.v2.bed.gz"
     },
     "mm39": {
-        "bt2_index": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/bt2_index/mm39",
-        "star_index": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/STAR_2.7.10b",
+        "bt2_indices": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/bt2_index/mm39",
+        "star_indices": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/STAR_2.7.10b",
         "chromosome_sizes": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/sequence/mm39.chrom.sizes",
         "gtf": "/ceph/project/milne_group/shared/seqnado_reference/mm39/UCSC/genes/mm39.ncbiRefSeq.gtf",
         "blacklist": "/ceph/project/milne_group/shared/seqnado_reference/mm39/mm10-blacklist.v2.Liftover.mm39.bed.gz"