Skip to content

Commit

Permalink
Merge pull request #8 from TCLamnidis/site_dir
Browse files Browse the repository at this point in the history
Version 1.0.0
  • Loading branch information
TCLamnidis authored Sep 19, 2022
2 parents d9fb1f8 + e9fad0f commit d9c92c5
Show file tree
Hide file tree
Showing 7 changed files with 201 additions and 86 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,6 @@ eager_outputs/
.next*
.RData
.Rhistory
.Rproj.user
.nfs*
dev/
44 changes: 44 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Autorun_eager: Changelog

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.0] - 19/09/2022

### `Added`

- Directory structure now includes a subdirectory with the site ID.
- Jobs are now submitted to `all.q`

### `Fixed`

- Fixed a bug where the bams of additional Autorun pipelines would be pulled for processing than intended.
- The sample names for single stranded libraries now include the suffix `_ss` in the Sample Name field. Avoids file name collisions and makes merging of genotypes easier and allows end users to pick between dsDNA and ssDNA genotypes for individuals where both are available.
- Library names of single stranded libraries also include the suffix `_ss` in the Library Name field. This ensures that rows in the MultiQC report are sorted correctly.

### `Dependencies`

- nf-core/eager=2.4.5

### `Deprecated`

## [0.1.0] - 03/02/2022

Initial release of Autorun_eager.

### `Added`

- Configuration file with Autorun_eager parameter defaults in dedicated profiles for each analysis type.
- Script to prepare input TSV from pandora info, using Autorun outputted bams as input.
- Script to crawl through eager_inputs directory and run eager on each newly generated/updated input.
- cron script with the basic commands needed to run daily for full automation.

### `Fixed`

### `Dependencies`

- [sidora.core](https://github.com/sidora-tools/sidora.core)
- [pandora2eager](https://github.com/sidora-tools/pandora2eager)
- [nf-core/eager](https://github.com/nf-core/eager) `2.4.2`

### `Deprecated`
13 changes: 13 additions & 0 deletions EVA_autorun.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: Sweave
LaTeX: pdfLaTeX
76 changes: 38 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Autorun_eager

Automated nf-core/eager processing of Autorun output bams.
Automated nf-core/eager processing of Autorun output bams.

# Quickstart
## Quickstart

- Run `prepare_eager_tsv.R` for human SG or TF data for a given sequencing batch:

```bash
prepare_eager_tsv.R -s 210429_K00233_0191_AHKJHFBBXY_Jena0014 -a SG -o eager_inputs/ -d .eva_credentials
prepare_eager_tsv.R -s 210802_K00233_0212_BHLH3FBBXY_SRdi_JR_BN -a TF -o eager_inputs/ -d .eva_credentials
prepare_eager_tsv.R -s <batch_Id> -a SG -o eager_inputs/ -d .eva_credentials
prepare_eager_tsv.R -s <batch_Id> -a TF -o eager_inputs/ -d .eva_credentials
```

- Run eager with the following script, which then runs on the generated TSV files:
- Run eager with the following script, which then runs on the generated TSV files:

```bash
run_eager.sh
Expand All @@ -24,17 +24,17 @@ In such cases, an eager input TSV will still be created, but UDG treatment for a

Contains the `autorun`, `SG` and `TF` profiles.

#### autorun
### autorun

Broader scope options and parameters for use across all processing with autorun.

Turns off automatic cleanup of intermediate files on successful completion of a run to allow resuming of the run when additional data becomes available, without rerunning completed steps.

#### SG
### SG

The standardised parameters for processing human shotgun data.

#### TF
### TF

The standardised parameters for processing human 1240k capture data.

Expand All @@ -46,66 +46,66 @@ An R script that when given a sequencing batch ID, Autorun Analysis type and PAN
Usage: ./prepare_eager_tsv.R [options] .credentials
Options:
-h, --help
Show this help message and exit
-h, --help
Show this help message and exit
-s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID
The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
for each individual in this run, containing all relevant processed BAM files
from the individual
-s SEQUENCING_BATCH_ID, --sequencing_batch_id=SEQUENCING_BATCH_ID
The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
for each individual in this run, containing all relevant processed BAM files
from the individual
-a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE
The analysis type to compile the data from. Should be one of: 'SG', 'TF'.
-a ANALYSIS_TYPE, --analysis_type=ANALYSIS_TYPE
The analysis type to compile the data from. Should be one of: 'SG', 'TF'.
-r, --rename
Changes all dots (.) in the Library_ID field of the output to underscores (_).
Some tools used in nf-core/eager will strip everything after the first dot (.)
from the name of the input file, which can cause naming conflicts in rare cases.
-r, --rename
Changes all dots (.) in the Library_ID field of the output to underscores (_).
Some tools used in nf-core/eager will strip everything after the first dot (.)
from the name of the input file, which can cause naming conflicts in rare cases.
-o OUTDIR/, --outDir=OUTDIR/
The desired output directory. Within this directory, one subdirectory will be
created per analysis type, within that one subdirectory per individual ID,
and one TSV within each of these directory.
-o OUTDIR/, --outDir=OUTDIR/
The desired output directory. Within this directory, one subdirectory will be
created per analysis type, within that one subdirectory per individual ID,
and one TSV within each of these directory.
-d, --debug_output
When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
Helpful to check all the output data in one place.
-d, --debug_output
When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
Helpful to check all the output data in one place.
Note: a valid sidora .credentials file is required. Contact the Pandora/Sidora team for details.
```
The eager input TSVs will be created in the following directory structure, given `-o eager_inputs`:
```
```text
eager_inputs
├── SG
│ └──IND/
│ └──IND
│ ├── IND001
│ └── IND002
└── TF
└──IND/
├── IND001
└── IND002
└──IND
├── IND001
└── IND002
```
## run_eager.sh
A wrapper shell script that goes through all TSVs in the `eager_inputs` directory, checks if a completed run exists for a given TSV, and submits/resumes an
eager run for that individual if necessary.
Currently uses eager version `2.4.2` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective
Currently uses eager version `2.4.5` and profiles `eva,archgen,medium_data,autorun` across all runs, with the `SG` or `TF` profiles used for their respective
data types.
The outputs are saved with the same directory structure as the inputs, but in a separate parent directory.
```
```text
eager_outputs
├── SG
│ └──IND/
│ └──IND
│ ├── IND001
│ └── IND002
└── TF
└──IND/
├── IND001
└── IND002
└──IND
├── IND001
└── IND002
```
4 changes: 4 additions & 0 deletions conf/Autorun.config
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ profiles {
config_profile_contact = 'Thiseas C. Lamnidis (@TCLamnidis)'
config_profile_description = 'Autorun_eager profile for automated processing in EVA'
}

process {
queue = "all.q"
}
}

// Profile with parameters for runs using the Human_SG bams as input.
Expand Down
98 changes: 61 additions & 37 deletions scripts/prepare_eager_tsv.R
Original file line number Diff line number Diff line change
Expand Up @@ -20,62 +20,75 @@ require(stringr)

## Validate analysis type option input
validate_analysis_type <- function(option, opt_str, value, parser) {
valid_entries=c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no?
valid_entries <- c("TF", "SG") ## TODO comment: should this be embedded within the function? You would want to maybe update this over time no?
ifelse(value %in% valid_entries, return(value), stop(call.=F, "\n[prepare_eager_tsv.R] error: Invalid analysis type: '", value,
"'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n"))
"'\nAccepted values: ", paste(valid_entries,collapse=", "),"\n\n"))
}

## Save one eager input TSV per individual. Rename if necessary. Input is already subset data.
save_ind_tsv <- function(data, rename, output_dir, ...) {

## Infer Individual Id(s) from input.
ind_id <- data %>% select(Sample_Name) %>% distinct() %>% pull()

ind_id <- data %>% select(individual.Full_Individual_Id) %>% distinct() %>% pull()
site_id <- substr(ind_id,1,3)

if (rename) {
data <- data %>% mutate(Library_ID=str_replace_all(Library_ID, "[.]", "_")) %>% ## Replace dots in the Library_ID to underscores.
select(Sample_Name, Library_ID, Lane, Colour_Chemistry,
SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM)
SeqType, Organism, Strandedness, UDG_Treatment, R1, R2, BAM)
}

ind_dir <- paste0(output_dir,"/",ind_id)
ind_dir <- paste0(output_dir, "/", site_id, "/", ind_id)

if (!dir.exists(ind_dir)) {write(paste0("[prepare_eager_tsv.R]: Creating output directory '",ind_dir,"'"), stdout())}

dir.create(ind_dir, showWarnings = F, recursive = T) ## Create output directory and subdirs if they do not exist.
readr::write_tsv(data, file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here.
data %>% select(-individual.Full_Individual_Id) %>% readr::write_tsv(file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here.
}

## Correspondance between '-a' analysis type and the name of Kay's pipeline.
## Only bams from the output autorun_name will be included in the output
autorun_name_from_analysis_type <- function(analysis_type) {
autorun_name <- case_when(
analysis_type == "TF" ~ "HUMAN_1240K",
analysis_type == "SG" ~ "HUMAN_SHOTGUN",
## Future analyses can be added here to pull those bams for eager processsing.
TRUE ~ NA_character_
)
return(autorun_name)
}

## MAIN ##

## Parse arguments ----------------------------
parser <- OptionParser(usage = "%prog [options] .credentials")
parser <- add_option(parser, c("-s", "--sequencing_batch_id"), type = 'character',
action = "store", dest = "sequencing_batch_id",
help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
action = "store", dest = "sequencing_batch_id",
help = "The Pandora sequencing batch ID to update eager input for. A TSV file will be prepared
for each individual in this run, containing all relevant processed BAM files
from the individual")
parser <- add_option(parser, c("-a", "--analysis_type"), type = 'character',
action = "callback", dest = "analysis_type",
callback = validate_analysis_type, default=NA,
help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.")
action = "callback", dest = "analysis_type",
callback = validate_analysis_type, default=NA,
help = "The analysis type to compile the data from. Should be one of: 'SG', 'TF'.")
parser <- add_option(parser, c("-r", "--rename"), type = 'logical',
action = 'store_true', dest = 'rename', default=F,
help = "Changes all dots (.) in the Library_ID field of the output to underscores (_).
action = 'store_true', dest = 'rename', default=F,
help = "Changes all dots (.) in the Library_ID field of the output to underscores (_).
Some tools used in nf-core/eager will strip everything after the first dot (.)
from the name of the input file, which can cause naming conflicts in rare cases."
)
)
parser <- add_option(parser, c("-o", "--outDir"), type = 'character',
action = "store", dest = "outdir",
help= "The desired output directory. Within this directory, one subdirectory will be
action = "store", dest = "outdir",
help= "The desired output directory. Within this directory, one subdirectory will be
created per analysis type, within that one subdirectory per individual ID,
and one TSV within each of these directory."
)
)
parser <- add_option(parser, c("-d", "--debug_output"), type = 'logical',
action = "store_true", dest = "debug", default=F,
help= "When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
action = "store_true", dest = "debug", default=F,
help= "When provided, the entire result table for the run will be saved as '<seq_batch_ID>.results.txt'.
Helpful to check all the output data in one place."
)

arguments <- parse_args(parser, positional_arguments = 1)
opts <- arguments$options

Expand Down Expand Up @@ -111,7 +124,7 @@ tibble_input_iids <- complete_pandora_table %>% filter(sequencing.Batch == seque

## Pull information from pandora, keeping only matching IIDs and requested Sequencing types.
results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individual.Full_Individual_Id"="individual.Full_Individual_Id")) %>%
filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id)) %>%
filter(grepl(paste0("\\.", analysis_type), sequencing.Full_Sequencing_Id), analysis.Analysis_Id == autorun_name_from_analysis_type(analysis_type)) %>%
select(individual.Full_Individual_Id,individual.Organism,library.Full_Library_Id,library.Protocol,analysis.Result_Directory,sequencing.Sequencing_Id,sequencing.Full_Sequencing_Id,sequencing.Single_Stranded) %>%
distinct() %>% ## TODO comment: would be worrying if not already unique, maybe consider throwing a warn?
group_by(individual.Full_Individual_Id) %>%
Expand All @@ -136,24 +149,35 @@ results <- inner_join(complete_pandora_table, tibble_input_iids, by=c("individua
TRUE ~ inferred_udg
),
R1=NA,
R2=NA
) %>%
select(
"Sample_Name"=individual.Full_Individual_Id,
"Library_ID"=library.Full_Library_Id,
"Lane",
"Colour_Chemistry",
"SeqType",
"Organism"=individual.Organism,
"Strandedness",
"UDG_Treatment",
"R1",
"R2",
"BAM"
R2=NA,
## Add `_ss` to sample name for ssDNA libraries. Avoids file name collisions and allows easier merging of genotypes for end users.
Sample_Name = case_when(
sequencing.Single_Stranded == 'yes' ~ paste0(individual.Full_Individual_Id, "_ss"),
TRUE ~ individual.Full_Individual_Id
),
## Also add the suffix to the Sample_ID part of the Library_ID. This ensures that in the MultiQC report, the ssDNA libraries will be sorted after the ssDNA sample.
Library_ID = case_when(
sequencing.Single_Stranded == 'yes' ~ paste0(Sample_Name, ".", stringr::str_split_fixed(library.Full_Library_Id, "\\.", 2)[,2]),
TRUE ~ library.Full_Library_Id
)
) %>%
select(
individual.Full_Individual_Id, ## Still used for grouping, so ss and ds results of the same sample end up in the same TSV.
"Sample_Name",
"Library_ID",
"Lane",
"Colour_Chemistry",
"SeqType",
"Organism"=individual.Organism,
"Strandedness",
"UDG_Treatment",
"R1",
"R2",
"BAM"
)

## Save results into single file for debugging
if ( opts$debug ) { write_tsv(results, file=paste0(sequencing_batch_id, ".", analysis_type, ".results.txt")) }

## Group by individual IDs and save each chunk as TSV
results %>% group_by(Sample_Name) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T)
results %>% group_by(individual.Full_Individual_Id) %>% group_walk(~save_ind_tsv(., rename=F, output_dir=output_dir), .keep=T)
Loading

0 comments on commit d9c92c5

Please sign in to comment.