diff --git a/.gitignore b/.gitignore index b32e02b..d4d722d 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,6 @@ eager_outputs/ .next* .RData .Rhistory +.Rproj.user +.nfs* +dev/ diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..700d5e0 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,44 @@ +# Autorun_eager: Changelog + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [1.0.0] - 19/09/2022 + +### `Added` + +- Directory structure now includes a subdirectory with the site ID. +- Jobs are now submitted to `all.q` + +### `Fixed` + +- Fixed a bug where the bams of additional Autorun pipelines would be pulled for processing than intended. +- The sample names for single stranded libraries now include the suffix `_ss` in the Sample Name field. Avoids file name collisions and makes merging of genotypes easier and allows end users to pick between dsDNA and ssDNA genotypes for individuals where both are available. +- Library names of single stranded libraries also include the suffix `_ss` in the Library Name field. This ensures that rows in the MultiQC report are sorted correctly. + +### `Dependencies` + +- nf-core/eager=2.4.5 + +### `Deprecated` + +## [0.1.0] - 03/02/2022 + +Initial release of Autorun_eager. + +### `Added` + +- Configuration file with Autorun_eager parameter defaults in dedicated profiles for each analysis type. +- Script to prepare input TSV from pandora info, using Autorun outputted bams as input. +- Script to crawl through eager_inputs directory and run eager on each newly generated/updated input. +- cron script with the basic commands needed to run daily for full automation. + +### `Fixed` + +### `Dependencies` + +- [sidora.core](https://github.com/sidora-tools/sidora.core) +- [pandora2eager](https://github.com/sidora-tools/pandora2eager) +- [nf-core/eager](https://github.com/nf-core/eager) `2.4.2` + +### `Deprecated` diff --git a/EVA_autorun.Rproj b/EVA_autorun.Rproj new file mode 100644 index 0000000..8e3c2eb --- /dev/null +++ b/EVA_autorun.Rproj @@ -0,0 +1,13 @@ +Version: 1.0 + +RestoreWorkspace: Default +SaveWorkspace: Default +AlwaysSaveHistory: Default + +EnableCodeIndexing: Yes +UseSpacesForTab: Yes +NumSpacesForTab: 2 +Encoding: UTF-8 + +RnwWeave: Sweave +LaTeX: pdfLaTeX diff --git a/README.md b/README.md index 7c37c70..447ba4d 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,17 @@ # Autorun_eager -Automated nf-core/eager processing of Autorun output bams. +Automated nf-core/eager processing of Autorun output bams. -# Quickstart +## Quickstart - Run `prepare_eager_tsv.R` for human SG or TF data for a given sequencing batch: ```bash - prepare_eager_tsv.R -s 210429_K00233_0191_AHKJHFBBXY_Jena0014 -a SG -o eager_inputs/ -d .eva_credentials - prepare_eager_tsv.R -s 210802_K00233_0212_BHLH3FBBXY_SRdi_JR_BN -a TF -o eager_inputs/ -d .eva_credentials + prepare_eager_tsv.R -s -a SG -o eager_inputs/ -d .eva_credentials + prepare_eager_tsv.R -s -a TF -o eager_inputs/ -d .eva_credentials ``` -- Run eager with the following script, which then runs on the generated TSV files: +- Run eager with the following script, which then runs on the generated TSV files: ```bash run_eager.sh @@ -24,17 +24,17 @@ In such cases, an eager input TSV will still be created, but UDG treatment for a Contains the `autorun`, `SG` and `TF` profiles. -#### autorun +### autorun Broader scope options and parameters for use across all processing with autorun. Turns off automatic cleanup of intermediate files on successful completion of a run to allow resuming of the run when additional data becomes available, without rerunning completed steps. -#### SG +### SG The standardised parameters for processing human shotgun data. -#### TF +### TF The standardised parameters for processing human 1240k capture data. @@ -46,46 +46,46 @@ An R script that when given a sequencing batch ID, Autorun Analysis type and PAN Usage: ./prepare_eager_tsv.R [options] .credentials Options: - -h, --help - Show this help message and exit + -h, --help + Show this help message and exit - -s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID - The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared - for each individual in this run, containing all relevant processed BAM files - from the individual + -s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID + The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared + for each individual in this run, containing all relevant processed BAM files + from the individual - -a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE - The analysis type to compile the data from. Should be one of: 'SG', 'TF'. + -a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE + The analysis type to compile the data from. Should be one of: 'SG', 'TF'. - -r, --rename - Changes all dots (.) in the Library_ID field of the output to underscores (_). - Some tools used in nf-core/eager will strip everything after the first dot (.) - from the name of the input file, which can cause naming conflicts in rare cases. + -r, --rename + Changes all dots (.) in the Library_ID field of the output to underscores (_). + Some tools used in nf-core/eager will strip everything after the first dot (.) + from the name of the input file, which can cause naming conflicts in rare cases. - -o OUTDIR/, --outDir=OUTDIR/ - The desired output directory. Within this directory, one subdirectory will be - created per analysis type, within that one subdirectory per individual ID, - and one TSV within each of these directory. + -o OUTDIR/, --outDir=OUTDIR/ + The desired output directory. Within this directory, one subdirectory will be + created per analysis type, within that one subdirectory per individual ID, + and one TSV within each of these directory. - -d, --debug_output - When provided, the entire result table for the run will be saved as '.results.txt'. - Helpful to check all the output data in one place. + -d, --debug_output + When provided, the entire result table for the run will be saved as '.results.txt'. + Helpful to check all the output data in one place. Note: a valid sidora .credentials file is required. Contact the Pandora/Sidora team for details. ``` The eager input TSVs will be created in the following directory structure, given `-o eager_inputs`: -``` +```text eager_inputs ├── SG -│ └──IND/ +│ └──IND │ ├── IND001 │ └── IND002 └── TF - └──IND/ - ├── IND001 - └── IND002 + └──IND + ├── IND001 + └── IND002 ``` ## run_eager.sh @@ -93,19 +93,19 @@ eager_inputs A wrapper shell script that goes through all TSVs in the `eager_inputs` directory, checks if a completed run exists for a given TSV, and submits/resumes an eager run for that individual if necessary. -Currently uses eager version `2.4.2` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective +Currently uses eager version `2.4.5` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective data types. The outputs are saved with the same directory structure as the inputs, but in a separate parent directory. -``` +```text eager_outputs ├── SG -│ └──IND/ +│ └──IND │ ├── IND001 │ └── IND002 └── TF - └──IND/ - ├── IND001 - └── IND002 + └──IND + ├── IND001 + └── IND002 ``` diff --git a/conf/Autorun.config b/conf/Autorun.config index e141372..811adbd 100644 --- a/conf/Autorun.config +++ b/conf/Autorun.config @@ -9,6 +9,10 @@ profiles { config_profile_contact = 'Thiseas C. Lamnidis (@TCLamnidis)' config_profile_description = 'Autorun_eager profile for automated processing in EVA' } + + process { + queue = "all.q" + } } // Profile with parameters for runs using the Human_SG bams as input. diff --git a/scripts/prepare_eager_tsv.R b/scripts/prepare_eager_tsv.R index 45ad63e..101e928 100755 --- a/scripts/prepare_eager_tsv.R +++ b/scripts/prepare_eager_tsv.R @@ -20,29 +20,42 @@ require(stringr) ## Validate analysis type option input validate_analysis_type <- function(option, opt_str, value, parser) { - valid_entries=c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no? + valid_entries <- c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no? ifelse(value %in% valid_entries, return(value), stop(call.=F, "\n[prepare_eager_tsv.R] error: Invalid analysis type: '", value, - "'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n")) + "'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n")) } ## Save one eager input TSV per individual. Rename if necessary. Input is already subset data. save_ind_tsv <- function(data, rename, output_dir, ...) { ## Infer Individual Id(s) from input. - ind_id <- data %>% select(Sample_Name) %>% distinct() %>% pull() - + ind_id <- data %>% select(individual.Full_Individual_Id) %>% distinct() %>% pull() + site_id <- substr(ind_id,1,3) + if (rename) { data <- data %>% mutate(Library_ID=str_replace_all(Library_ID, "[.]", "_")) %>% ## Replace dots in the Library_ID to underscores. select(Sample_Name, Library_ID, Lane, Colour_Chemistry, - SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM) + SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM) } - ind_dir <- paste0(output_dir,"/",ind_id) + ind_dir <- paste0(output_dir, "/", site_id, "/", ind_id) if (!dir.exists(ind_dir)) {write(paste0("[prepare_eager_tsv.R]: Creating output directory '",ind_dir,"'"), stdout())} dir.create(ind_dir, showWarnings = F, recursive = T) ## Create output directory and subdirs if they do not exist. - readr::write_tsv(data, file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here. + data %>% select(-individual.Full_Individual_Id) %>% readr::write_tsv(file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here. +} + +## Correspondance between '-a' analysis type and the name of Kay's pipeline. +## Only bams from the output autorun_name will be included in the output +autorun_name_from_analysis_type <- function(analysis_type) { + autorun_name <- case_when( + analysis_type == "TF" ~ "HUMAN_1240K", + analysis_type == "SG" ~ "HUMAN_SHOTGUN", + ## Future analyses can be added here to pull those bams for eager processsing. + TRUE ~ NA_character_ + ) + return(autorun_name) } ## MAIN ## @@ -50,32 +63,32 @@ save_ind_tsv <- function(data, rename, output_dir, ...) { ## Parse arguments ---------------------------- parser <- OptionParser(usage = "%prog [options] .credentials") parser <- add_option(parser, c("-s", "--sequencing_batch_id"), type = 'character', - action = "store", dest = "sequencing_batch_id", - help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared + action = "store", dest = "sequencing_batch_id", + help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared for each individual in this run, containing all relevant processed BAM files from the individual") parser <- add_option(parser, c("-a", "--analysis_type"), type = 'character', - action = "callback", dest = "analysis_type", - callback = validate_analysis_type, default=NA, - help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.") + action = "callback", dest = "analysis_type", + callback = validate_analysis_type, default=NA, + help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.") parser <- add_option(parser, c("-r", "--rename"), type = 'logical', - action = 'store_true', dest = 'rename', default=F, - help = "Changes all dots (.) in the Library_ID field of the output to underscores (_). + action = 'store_true', dest = 'rename', default=F, + help = "Changes all dots (.) in the Library_ID field of the output to underscores (_). Some tools used in nf-core/eager will strip everything after the first dot (.) from the name of the input file, which can cause naming conflicts in rare cases." - ) + ) parser <- add_option(parser, c("-o", "--outDir"), type = 'character', - action = "store", dest = "outdir", - help= "The desired output directory. Within this directory, one subdirectory will be + action = "store", dest = "outdir", + help= "The desired output directory. Within this directory, one subdirectory will be created per analysis type, within that one subdirectory per individual ID, and one TSV within each of these directory." - ) + ) parser <- add_option(parser, c("-d", "--debug_output"), type = 'logical', - action = "store_true", dest = "debug", default=F, - help= "When provided, the entire result table for the run will be saved as '.results.txt'. + action = "store_true", dest = "debug", default=F, + help= "When provided, the entire result table for the run will be saved as '.results.txt'. Helpful to check all the output data in one place." ) - + arguments <- parse_args(parser, positional_arguments = 1) opts <- arguments$options @@ -111,7 +124,7 @@ tibble_input_iids <- complete_pandora_table %>% filter(sequencing.Batch == seque ## Pull information from pandora, keeping only matching IIDs and requested Sequencing types. results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individual.Full_Individual_Id"="individual.Full_Individual_Id")) %>% - filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id)) %>% + filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id), analysis.Analysis_Id == autorun_name_from_analysis_type(analysis_type)) %>% select(individual.Full_Individual_Id,individual.Organism,library.Full_Library_Id,library.Protocol,analysis.Result_Directory,sequencing.Sequencing_Id,sequencing.Full_Sequencing_Id,sequencing.Single_Stranded) %>% distinct() %>% ## TODO comment: would be worrying if not already unique, maybe consider throwing a warn? group_by(individual.Full_Individual_Id) %>% @@ -136,24 +149,35 @@ results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individua TRUE ~ inferred_udg ), R1=NA, - R2=NA - ) %>% - select( - "Sample_Name"=individual.Full_Individual_Id, - "Library_ID"=library.Full_Library_Id, - "Lane", - "Colour_Chemistry", - "SeqType", - "Organism"=individual.Organism, - "Strandedness", - "UDG_Treatment", - "R1", - "R2", - "BAM" + R2=NA, + ## Add `_ss` to sample name for ssDNA libraries. Avoids file name collisions and allows easier merging of genotypes for end users. + Sample_Name = case_when( + sequencing.Single_Stranded == 'yes' ~ paste0(individual.Full_Individual_Id, "_ss"), + TRUE ~ individual.Full_Individual_Id + ), + ## Also add the suffix to the Sample_ID part of the Library_ID. This ensures that in the MultiQC report, the ssDNA libraries will be sorted after the ssDNA sample. + Library_ID = case_when( + sequencing.Single_Stranded == 'yes' ~ paste0(Sample_Name, ".", stringr::str_split_fixed(library.Full_Library_Id, "\\.", 2)[,2]), + TRUE ~ library.Full_Library_Id ) + ) %>% + select( + individual.Full_Individual_Id, ## Still used for grouping, so ss and ds results of the same sample end up in the same TSV. + "Sample_Name", + "Library_ID", + "Lane", + "Colour_Chemistry", + "SeqType", + "Organism"=individual.Organism, + "Strandedness", + "UDG_Treatment", + "R1", + "R2", + "BAM" + ) ## Save results into single file for debugging if ( opts$debug ) { write_tsv(results, file=paste0(sequencing_batch_id, ".", analysis_type, ".results.txt")) } ## Group by individual IDs and save each chunk as TSV -results %>% group_by(Sample_Name) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T) +results %>% group_by(individual.Full_Individual_Id) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T) diff --git a/scripts/run_Eager.sh b/scripts/run_Eager.sh index 4adcef5..e5aef24 100755 --- a/scripts/run_Eager.sh +++ b/scripts/run_Eager.sh @@ -1,12 +1,23 @@ #!/usr/bin/env bash -nxf_path="/mnt/archgen/tools/nextflow/21.04.3.5560" -eager_version='2.4.2' +## Flood execution. Useful for testing/fast processing of small batches. +if [[ $1 == "-r" || $1 == "--rush" ]]; then + rush="-bg" +else + rush='' +fi + +nxf_path="/home/srv_autoeager/conda/envs/autoeager/bin/" +eager_version='2.4.5' autorun_config='/mnt/archgen/Autorun_eager/conf/Autorun.config' ## Contains specific profiles with params for each analysis type. root_input_dir='/mnt/archgen/Autorun_eager/eager_inputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each individual. #### E.g. /mnt/archgen/Autorun_eager/eager_inputs/SG/GUB001/GUB001.tsv root_output_dir='/mnt/archgen/Autorun_eager/eager_outputs' +## Testing +# root_input_dir='/mnt/archgen/Autorun_eager/dev/testing/eager_inputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each individual. +# root_output_dir='/mnt/archgen/Autorun_eager/dev/testing/eager_outputs' + ## Set base profiles for EVA cluster. nextflow_profiles="eva,archgen,medium_data,autorun" @@ -20,15 +31,30 @@ for analysis_type in "SG" "TF"; do # echo ${analysis_type} analysis_profiles="${nextflow_profiles},${analysis_type}" # echo "${root_input_dir}/${analysis_type}" - for eager_input in ${root_input_dir}/${analysis_type}/*/*.tsv; do + for eager_input in ${root_input_dir}/${analysis_type}/*/*/*.tsv; do ## Set output directory name from eager input name - eager_output_dir="${root_output_dir}/${analysis_type}/$(basename ${eager_input} .tsv)" - # ## Run name is individual ID followed by analysis_type - # run_name="$(basename ${eager_input} .tsv)_${analysis_type}" - # echo $run_name + ind_id=$(basename ${eager_input} .tsv) + site_id="${ind_id:0:3}" + eager_output_dir="${root_output_dir}/${analysis_type}/${site_id}/${ind_id}" + + run_name="-resume" ## To be changed once/if a way to give informative run names becomes available + + ## TODO Give informative run names for easier trackingin tower.nf + ## If the output directory exists, assume you need to resume a run, else just name it + # if [[ -d "${eager_output_dir}" ]]; then + # command_string="-resume" + # else + # command_string="-name" + # fi + # ## Run name is individual ID followed by analysis_type. -resume or -name added as appropriate + # run_name="${command_string} $(basename ${eager_input} .tsv)_${analysis_type}" + ## If no multiqc_report exists (last step of eager), or TSV is newer than the report, start an eager run. #### Always running with resume will ensure runs are only ever resumed instead of restarting. if [[ ${eager_input} -nt ${eager_output_dir}/multiqc/multiqc_report.html ]]; then + + ## Change to input directory to run from, to keep one cwd per run. + cd $(dirname ${eager_input}) ## Debugging info. echo "Running eager on ${eager_input}:" echo "${nxf_path}/nextflow run nf-core/eager \ @@ -40,12 +66,11 @@ for analysis_type in "SG" "TF"; do -w ${eager_output_dir}/work \ -with-tower \ -ansi-log false \ - -resume" + ${run_name} ${rush}" ## Actually run eager now. - ## Email the submitting user the resulting MultiQC report. ## Monitor run in nf tower. Only works if TOWER_ACCESS_TOKEN is set. - ## TODO Maybe an EVA_Autorun account can be made for tower, to monitor runs outside of users? + ## Runs show in the Autorun_Eager workspace on tower.nf ${nxf_path}/nextflow run nf-core/eager \ -r ${eager_version} \ -profile ${analysis_profiles} \ @@ -55,7 +80,9 @@ for analysis_type in "SG" "TF"; do -w ${eager_output_dir}/work \ -with-tower \ -ansi-log false \ - -resume # ${run_name} + ${run_name} ${rush} + + cd ${root_input_dir} ## Then back to root dir fi done done