- Author(s): Alejandra Hernández Segura
- Organization: Rijksinstituut voor Volksgezondheid en Milieu (RIVM)
- Department: Infektieziekteonderzoek, Diagnostiek en Laboratorium Surveillance (IDS), Bacteriologie (BPD)
- Start date: 06 - 05 - 2022
The goal of this pipeline is to perform cgMLST on bacterial genomes. As input, it requires an assembly for each sample you want to analyze in the form of a single ‘.fasta’ file.
Importantly, the Juno-cgMLST pipeline also works directly on output generated from the Juno-Assembly pipeline.
The Juno-cgMLST uses ChewBBACA to find the cgMLST profile of the given genomes. Besides, the result table with the allele numbers is translated to a table with (sha1) hashes so that they can be shared and compared even if each result used a different database. Note: We highly encourage to use the hashes for the analysis instead of the allele numbers. This will make your data more reproducible in case there was an update in the database that assigns allele numbers or to share it with other colleagues.
- Linux + conda A Linux-like environment with at least 'miniconda' installed.
- Python3.7.6 .
- Clone the repository:
git clone https://github.com/RIVM-bioinformatics/juno-cgmlst.git
Alternatively, you can download it manually as a zip file (you will need to unzip it then).
- Enter the directory with the pipeline and install the main environment:
cd juno-cgmlst
conda env update -f envs/juno_cgmlst.yaml
-h, --help
Shows the help of the pipeline
-i, --input
Directory with the input (fasta) files. The fasta files should be all in this directory (no subdirectories) and have the extension '.fasta'.
-h, --help show this help message and exit
-i DIR, --input DIR Relative or absolute path to the input directory. It
must either be the output directory of the Juno-
Assembly pipeline or it must contain all the raw reads
(fastq) and assemblies (fasta) files for all samples
to be processed.
-m FILE, --metadata FILE
Relative or absolute path to the metadata csv file. If
provided, it must contain at least one column named
'sample' with the name of the sample (same than file
name but removing the suffix _R1.fastq.gz) and a
column called 'genus'. The genus provided will be used
to choose the cgMLST schema(s). If a metadata file is
provided, it will overwrite the --genus argument for
the samples present in the metadata file.
-g FILE, --genus FILE
Genus name (any genus in the metadata file will
overwrite this argument). It should be given as two
words (e.g. --genus Salmonella)
-o DIR, --output DIR Relative or absolute path to the output directory. If
non is given, an 'output' directory will be created in
the current directory.
-d DIR, --db_dir DIR Relative or absolute path to the directory that
contains the databases for all the tools used in this
pipeline or where they should be downloaded. Default
is: /mnt/db/juno/cgmlst
-c INT, --cores INT Number of cores to use. Default is 300
-q STR, --queue STR Name of the queue that the job will be submitted to if
working on a cluster.
-w INT, --time-limit INT
Time limit per job in minutes (passed as -W argument
to bsub). Jobs will be killed if not finished in this
time.
-l, --local Running pipeline locally (instead of in a computer
cluster). Default is running it in a cluster.
-u, --unlock Unlock output directory (passed to snakemake).
-n, --dryrun Dry run printing steps to be taken in the pipeline
without actually running it (passed to snakemake).
--rerunincomplete Re-run jobs if they are marked as incomplete (passed
to snakemake).
--snakemake-args [SNAKEMAKE_ARGS [SNAKEMAKE_ARGS ...]]
Extra arguments to be passed to snakemake API (https:/
/snakemake.readthedocs.io/en/stable/api_reference/snak
emake.html).
--version, -v Show Jonu_cgmlst version number and exit
Example metadata file:
sample | genus |
---|---|
sample1 | salmonella |
Note: The fastq files corresponding to this sample would probably be something like sample1_S1_R1_0001.fastq.gz and sample2_S1_R1_0001.fastq.gz and the fasta file sample1.fasta. Also note that the column titles of the metadata.csv file are all in lower case.
python juno_cgmlst.py -i [dir/to/input_directory]
python juno_cgmlst.py -i my_input_dir -o my_results_dir --db_dir my_db_dir --metadata path/to/my/metadata.csv
Note for large datasets: If your dataset is large (more than 30 samples), t is necessary to give the pipeline extra time to process samples. This means that it needs to run with the -w
or --time-limit
argument. The default is 60 (minutes) but for large datasets (more than 30 samples) it should be increased accordingly Example:
python juno_cgmlst.py -i my_large_input_dir -o my_results_dir --db_dir my_db_dir --metadata path/to/my/metadata.csv --time-limit 120
- log: Log files with output and error files from each Snakemake rule/step that is performed.
- audit_trail: Information about the versions of software and databases used.
- output per sample: The pipeline will create one subfolder per each step performed. These subfolders will in turn contain another subfolder per sample. To understand the output, please refer to the manual of ChewBBACA.
- ChewBBACA can be very slow so for large datasets it is necessary to give the pipeline extra time to process samples. This means that it needs to run with the
-w
or--time-limit
argument. The default is 60 (minutes) but for large datasets (more than 30 samples) it should be increased accordingly. - All default values have been chosen to work with the RIVM Linux environment, therefore, there might not be applicable to other environments (although they should work if the appropriate arguments/parameters are given).
- ChewBBACA has some inherent 'issues' that are known:
- The analysis of separate samples is not completely parallelized. ChewBBACA tries to optimize the analysis but many steps have to be shared between samples. ChewBBACA blocks the directory with the database so that two different analyses cannot be done using the same database at the same time. This has a reason and that is that if a new allele is found in your dataset, ChewBBACA ensures that a new allele number gets assigned to it and that this one is not concurrently assigned by another running process in the same database. However, this feature can be really problematic for databases where multiple people might be using the pipeline at the same time (as may be the case at the RIVM). You cannot put samples running in parallel (which might be only moderately more efficient anyway because of the shared steps which are long anyway) and more importantly, two people cannot be running an analysis at the same time. This includes automated runs in which samples from different genera, such as Shigella and Escherichia will be processed at the same time and both will be running the same cgMLST scheme but would interfere with each other causing the crash of one of them.
- Every time a new allele is found for a schema, it can cause some results to lose reproducibility. In these cases, the allele distance might be underestimated so this means that the ability to detect a sample as part of an outbreak would not decrease. You would, however, include more false positives. It is something that can be tolerated for our purposes but can cause confusion among analysts. It, in theory, should be improved once your database is more stable and no more novel alleles are found. The explanation of why that happens can be found here.
- Any issue can be reported in the Issues section of this repository.
- Include the hashing within the ChewBBACA algorithm.
- Build and pair a database for the alleles and loci to store the information about the sequences.
- Optimize the algorithm of ChewBBACA by not locking the whole database during the whole duration of the analysis.
This pipeline is licensed with an AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.
- Contact person: Roxanne Wolthuis
- Email [email protected]