Merge pull request #8 from TCLamnidis/site_dir

Version 1.0.0
MPI-EVA-Archaeogenetics · Sep 19, 2022 · d9c92c5 · d9c92c5
2 parents d9fb1f8 + e9fad0f
commit d9c92c5
Show file tree

Hide file tree

Showing 7 changed files with 201 additions and 86 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,6 @@ eager_outputs/
 .next*
 .RData
 .Rhistory
+.Rproj.user
+.nfs*
+dev/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,44 @@
+# Autorun_eager: Changelog
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [1.0.0] - 19/09/2022
+
+### `Added`
+
+- Directory structure now includes a subdirectory with the site ID.
+- Jobs are now submitted to `all.q`
+
+### `Fixed`
+
+- Fixed a bug where the bams of additional Autorun pipelines would be pulled for processing than intended.
+- The sample names for single stranded libraries now include the suffix `_ss` in the Sample Name field. Avoids file name collisions and makes merging of genotypes easier and allows end users to pick between dsDNA and ssDNA genotypes for individuals where both are available.
+- Library names of single stranded libraries also include the suffix `_ss` in the Library Name field. This ensures that rows in the MultiQC report are sorted correctly.
+
+### `Dependencies`
+
+- nf-core/eager=2.4.5
+
+### `Deprecated`
+
+## [0.1.0] - 03/02/2022
+
+Initial release of Autorun_eager.
+
+### `Added`
+
+- Configuration file with Autorun_eager parameter defaults in dedicated profiles for each analysis type.
+- Script to prepare input TSV from pandora info, using Autorun outputted bams as input.
+- Script to crawl through eager_inputs directory and run eager on each newly generated/updated input.
+- cron script with the basic commands needed to run daily for full automation.
+
+### `Fixed`
+
+### `Dependencies`
+
+- [sidora.core](https://github.com/sidora-tools/sidora.core)
+- [pandora2eager](https://github.com/sidora-tools/pandora2eager)
+- [nf-core/eager](https://github.com/nf-core/eager) `2.4.2`
+
+### `Deprecated`
diff --git a/EVA_autorun.Rproj b/EVA_autorun.Rproj
@@ -0,0 +1,13 @@
+Version: 1.0
+
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: Sweave
+LaTeX: pdfLaTeX
diff --git a/README.md b/README.md
@@ -1,17 +1,17 @@
 # Autorun_eager
 
-Automated nf-core/eager processing of Autorun output bams. 
+Automated nf-core/eager processing of Autorun output bams.
 
-# Quickstart
+## Quickstart
 
 - Run `prepare_eager_tsv.R` for human SG or TF data for a given sequencing batch:
 
     ```bash
-    prepare_eager_tsv.R -s 210429_K00233_0191_AHKJHFBBXY_Jena0014 -a SG -o eager_inputs/ -d .eva_credentials
-    prepare_eager_tsv.R -s 210802_K00233_0212_BHLH3FBBXY_SRdi_JR_BN -a TF -o eager_inputs/ -d .eva_credentials
+    prepare_eager_tsv.R -s <batch_Id> -a SG -o eager_inputs/ -d .eva_credentials
+    prepare_eager_tsv.R -s <batch_Id> -a TF -o eager_inputs/ -d .eva_credentials
     ```
 
-- Run eager with the following script, which then runs on the generated TSV files: 
+- Run eager with the following script, which then runs on the generated TSV files:
 
     ```bash
     run_eager.sh
@@ -24,17 +24,17 @@ In such cases, an eager input TSV will still be created, but UDG treatment for a
 
 Contains the `autorun`, `SG` and `TF` profiles.
 
-#### autorun
+### autorun
 
 Broader scope options and parameters for use across all processing with autorun.
 
 Turns off automatic cleanup of intermediate files on successful completion of a run to allow resuming of the run when additional data becomes available, without rerunning completed steps.
 
-#### SG
+### SG
 
 The standardised parameters for processing human shotgun data.
 
-#### TF
+### TF
 
 The standardised parameters for processing human 1240k capture data.
 
@@ -46,66 +46,66 @@ An R script that when given a sequencing batch ID, Autorun Analysis type and PAN
 Usage: ./prepare_eager_tsv.R [options] .credentials
 
 Options:
-	-h, --help
-		Show this help message and exit
+    -h, --help
+        Show this help message and exit
 
-	-s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID
-		The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
-			for each individual in this run, containing all relevant processed BAM files
-			from the individual
+    -s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID
+        The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
+            for each individual in this run, containing all relevant processed BAM files
+            from the individual
 
-	-a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE
-		The analysis type to compile the data from. Should be one of: 'SG', 'TF'.
+    -a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE
+        The analysis type to compile the data from. Should be one of: 'SG', 'TF'.
 
-	-r, --rename
-		Changes all dots (.) in the Library_ID field of the output to underscores (_).
-			Some tools used in nf-core/eager will strip everything after the first dot (.)
-			from the name of the input file, which can cause naming conflicts in rare cases.
+    -r, --rename
+        Changes all dots (.) in the Library_ID field of the output to underscores (_).
+            Some tools used in nf-core/eager will strip everything after the first dot (.)
+            from the name of the input file, which can cause naming conflicts in rare cases.
 
-	-o OUTDIR/, --outDir=OUTDIR/
-		The desired output directory. Within this directory, one subdirectory will be 
-			created per analysis type, within that one subdirectory per individual ID,
-			and one TSV within each of these directory.
+    -o OUTDIR/, --outDir=OUTDIR/
+        The desired output directory. Within this directory, one subdirectory will be 
+            created per analysis type, within that one subdirectory per individual ID,
+            and one TSV within each of these directory.
 
-	-d, --debug_output
-		When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
-			Helpful to check all the output data in one place.
+    -d, --debug_output
+        When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
+            Helpful to check all the output data in one place.
 
 Note: a valid sidora .credentials file is required. Contact the Pandora/Sidora team for details.
 ```
 
 The eager input TSVs will be created in the following directory structure, given `-o eager_inputs`:
 
-```
+```text
 eager_inputs
 ├── SG
-│   └──IND/
+│   └──IND
 │       ├── IND001
 │       └── IND002
 └── TF
-    └──IND/
-        ├── IND001
-        └── IND002
+    └──IND
+         ├── IND001
+         └── IND002
 ```
 
 ## run_eager.sh
 
 A wrapper shell script that goes through all TSVs in the `eager_inputs` directory, checks if a completed run exists for a given TSV, and submits/resumes an
 eager run for that individual if necessary.
 
-Currently uses eager version `2.4.2` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective
+Currently uses eager version `2.4.5` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective
 data types.
 
 The outputs are saved with the same directory structure as the inputs, but in a separate parent directory.
 
-```
+```text
 eager_outputs
 ├── SG
-│   └──IND/
+│   └──IND
 │       ├── IND001
 │       └── IND002
 └── TF
-    └──IND/
-        ├── IND001
-        └── IND002
+    └──IND
+         ├── IND001
+         └── IND002
 ```
diff --git a/conf/Autorun.config b/conf/Autorun.config
@@ -9,6 +9,10 @@ profiles {
       config_profile_contact = 'Thiseas C. Lamnidis (@TCLamnidis)'
       config_profile_description = 'Autorun_eager profile for automated processing in EVA'
     }
+
+    process {
+      queue = "all.q"
+    }
   }
 
   // Profile with parameters for runs using the Human_SG bams as input.

diff --git a/scripts/prepare_eager_tsv.R b/scripts/prepare_eager_tsv.R
@@ -20,62 +20,75 @@ require(stringr)
 
 ## Validate analysis type option input
 validate_analysis_type <- function(option, opt_str, value, parser) {
-  valid_entries=c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no? 
+  valid_entries <- c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no? 
   ifelse(value %in% valid_entries, return(value), stop(call.=F, "\n[prepare_eager_tsv.R] error: Invalid analysis type: '", value, 
-                                                       "'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n"))
+                                                      "'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n"))
 }
 
 ## Save one eager input TSV per individual. Rename if necessary. Input is already subset data.
 save_ind_tsv <- function(data, rename, output_dir, ...) {
 
   ## Infer Individual Id(s) from input.
-  ind_id <- data %>% select(Sample_Name) %>% distinct() %>% pull()
-
+  ind_id <- data %>% select(individual.Full_Individual_Id) %>% distinct() %>% pull()
+  site_id <- substr(ind_id,1,3)
+
   if (rename) {
     data <- data %>% mutate(Library_ID=str_replace_all(Library_ID, "[.]", "_")) %>% ## Replace dots in the Library_ID to underscores.
       select(Sample_Name, Library_ID,  Lane, Colour_Chemistry, 
-             SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM)
+            SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM)
   }
 
-  ind_dir <- paste0(output_dir,"/",ind_id)
+  ind_dir <- paste0(output_dir, "/", site_id, "/", ind_id)
 
   if (!dir.exists(ind_dir)) {write(paste0("[prepare_eager_tsv.R]: Creating output directory '",ind_dir,"'"), stdout())}
 
   dir.create(ind_dir, showWarnings = F, recursive = T) ## Create output directory and subdirs if they do not exist.
-  readr::write_tsv(data, file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here.
+ data %>% select(-individual.Full_Individual_Id) %>%  readr::write_tsv(file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here.
+}
+
+## Correspondance between '-a' analysis type and the name of Kay's pipeline.
+##    Only bams from the output autorun_name will be included in the output
+autorun_name_from_analysis_type <- function(analysis_type) {
+  autorun_name <- case_when(
+    analysis_type == "TF" ~ "HUMAN_1240K",
+    analysis_type == "SG" ~ "HUMAN_SHOTGUN",
+    ## Future analyses can be added here to pull those bams for eager processsing.
+    TRUE ~ NA_character_
+  )
+  return(autorun_name)
 }
 
 ## MAIN ##
 
 ## Parse arguments ----------------------------
 parser <- OptionParser(usage = "%prog [options] .credentials")
 parser <- add_option(parser, c("-s", "--sequencing_batch_id"), type = 'character', 
-                     action = "store", dest = "sequencing_batch_id", 
-                     help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
+                    action = "store", dest = "sequencing_batch_id", 
+                    help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
 			for each individual in this run, containing all relevant processed BAM files
 			from the individual")
 parser <- add_option(parser, c("-a", "--analysis_type"), type = 'character',
-                     action = "callback", dest = "analysis_type",
-                     callback = validate_analysis_type, default=NA,
-                     help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.")
+                    action = "callback", dest = "analysis_type",
+                    callback = validate_analysis_type, default=NA,
+                    help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.")
 parser <- add_option(parser, c("-r", "--rename"), type = 'logical',
-                     action = 'store_true', dest = 'rename', default=F,
-                     help = "Changes all dots (.) in the Library_ID field of the output to underscores (_).
+                    action = 'store_true', dest = 'rename', default=F,
+                    help = "Changes all dots (.) in the Library_ID field of the output to underscores (_).
 			Some tools used in nf-core/eager will strip everything after the first dot (.)
 			from the name of the input file, which can cause naming conflicts in rare cases."
-                     )
+                    )
 parser <- add_option(parser, c("-o", "--outDir"), type = 'character',
-                     action = "store", dest = "outdir",
-                     help= "The desired output directory. Within this directory, one subdirectory will be 
+                    action = "store", dest = "outdir",
+                    help= "The desired output directory. Within this directory, one subdirectory will be 
 			created per analysis type, within that one subdirectory per individual ID,
 			and one TSV within each of these directory."
-                     )
+                    )
 parser <- add_option(parser, c("-d", "--debug_output"), type = 'logical',
-                     action = "store_true", dest = "debug", default=F,
-                     help= "When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
+                    action = "store_true", dest = "debug", default=F,
+                    help= "When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
 			Helpful to check all the output data in one place."
 )
-                     
+
 arguments <- parse_args(parser, positional_arguments = 1)
 opts <- arguments$options
 
@@ -111,7 +124,7 @@ tibble_input_iids <- complete_pandora_table %>% filter(sequencing.Batch == seque
 
 ## Pull information from pandora, keeping only matching IIDs and requested Sequencing types.
 results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individual.Full_Individual_Id"="individual.Full_Individual_Id")) %>%
-  filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id)) %>%
+  filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id), analysis.Analysis_Id == autorun_name_from_analysis_type(analysis_type)) %>%
   select(individual.Full_Individual_Id,individual.Organism,library.Full_Library_Id,library.Protocol,analysis.Result_Directory,sequencing.Sequencing_Id,sequencing.Full_Sequencing_Id,sequencing.Single_Stranded) %>%
   distinct() %>% ## TODO comment: would be worrying if not already unique, maybe consider throwing a warn?
   group_by(individual.Full_Individual_Id) %>%
@@ -136,24 +149,35 @@ results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individua
       TRUE ~ inferred_udg
     ),
     R1=NA,
-    R2=NA
-    ) %>%
-  select(
-     "Sample_Name"=individual.Full_Individual_Id,
-     "Library_ID"=library.Full_Library_Id,
-     "Lane",
-     "Colour_Chemistry",
-     "SeqType",
-     "Organism"=individual.Organism,
-     "Strandedness",
-     "UDG_Treatment",
-     "R1",
-     "R2",
-     "BAM"
+    R2=NA,
+    ## Add `_ss` to sample name for ssDNA libraries. Avoids file name collisions and allows easier merging of genotypes for end users.
+    Sample_Name = case_when(
+      sequencing.Single_Stranded == 'yes' ~ paste0(individual.Full_Individual_Id, "_ss"),
+      TRUE ~ individual.Full_Individual_Id
+    ),
+    ## Also add the suffix to the Sample_ID part of the Library_ID. This ensures that in the MultiQC report, the ssDNA libraries will be sorted after the ssDNA sample.
+    Library_ID = case_when(
+      sequencing.Single_Stranded == 'yes' ~ paste0(Sample_Name, ".", stringr::str_split_fixed(library.Full_Library_Id, "\\.", 2)[,2]),
+      TRUE ~ library.Full_Library_Id
     )
+  ) %>%
+  select(
+    individual.Full_Individual_Id, ## Still used for grouping, so ss and ds results of the same sample end up in the same TSV.
+    "Sample_Name",
+    "Library_ID",
+    "Lane",
+    "Colour_Chemistry",
+    "SeqType",
+    "Organism"=individual.Organism,
+    "Strandedness",
+    "UDG_Treatment",
+    "R1",
+    "R2",
+    "BAM"
+  )
 
 ## Save results into single file for debugging
 if ( opts$debug ) { write_tsv(results, file=paste0(sequencing_batch_id, ".", analysis_type, ".results.txt")) }
 
 ## Group by individual IDs and save each chunk as TSV
-results %>% group_by(Sample_Name) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T)
+results %>% group_by(individual.Full_Individual_Id) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T)
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,3 +5,6 @@ eager_outputs/ @@
     .next*
     .RData
     .Rhistory
+    .Rproj.user
+    .nfs*
+    dev/