-
Notifications
You must be signed in to change notification settings - Fork 4
9.2 more info on atpoc
atpoc takes as input VIBRANT, PhiSpy, and/or geNomad results directories (with prophage predictions) for a single sample together with a prepTG (target genomes) database to determine how conserved the sample's Phage-ome is across the genomes in the database. This could be insightful as to when a temperate phage might have integrated into a species genome and whether certain prophages are unique to certain strains.
The specific cutoffs used in fai for gene cluster detection in target genomes can be adapted as needed. Alternatively, a simple BLASTp search can be performed instead to determine all homologs of proteins for each BGC from the focal sample in target genomes regardless of whether they are similarly co-located or not. Default parameters for fai-based detection of phages are: 50% of phage genes need to be identified in whole or fragmented along scaffold edges via DIAMOND BLASTp at an E-value threshold of 1e-10. A syntenic similarity of 0.4 is also required. Note, there is a possibility that some phages might be highly paralogous and atpoc might not be able to resolve this super well - e.g. if your sample has two paralogous phages it might say they are both present in a target genome when only one is.
If fai is used for searching (the default), check out the individual fai results (in the subdirectory
fai_or_blast_Results/
) for each phage to see details on the conservation of individual genes. Further, follow up analysis can be performed using zol per phage to summarize the conservation of distinct ortholog groups, evolutionary stats, and functional info.
By default, prodigal-gv will be used for gene calling but you can use pyrodigal (with models for gene calling in bacteria) via the --use_pyrodigal
. This might be more appropriate if gene calling for the target genomes was performed with default pyrodigal/prodigal instead of prodigal-gv via prepTG.
`
We also recommend checking out PHANOTATE and Pharokka for detailed annotation of phages or obtaining better gene calls and performing more manual fai & zol analysis.
The following is a mini-tutorial on using atpoc to investigate the novelty of the Phage-ome of Streptococcus pyogenes st. M1_GAS to representative Streptococcus genomes we made available in a precompiled prepTG database. The focal Streptococcus pyogenes genome is the same one used as an example by PhiSpy.
First, lets download the query genome of interest from PhiSpy's git repo and also format it to FASTA format (for VIBRANT/geNomad):
# Download genome from NCBI
wget https://raw.githubusercontent.com/linsalrob/PhiSpy/master/tests/Streptococcus_pyogenes_M1_GAS.gb
# reformat to fasta (using script available in zol)
genbankToFasta.py Streptococcus_pyogenes_M1_GAS.gb > Streptococcus_pyogenes_M1_GAS.fna
Next, we can run PhiSpy, VIBRANT, and geNomade to identify phages in the focal genome:
# in some conda environment or setting with PhiSpy available
PhiSpy.py Streptococcus_pyogenes_M1_GAS.gb -o PhiSpy_Results/
# in some conda environment or setting with VIBRANT available
VIBRANT_run.py -i Streptococcus_pyogenes_M1_GAS.fna -folder VIBRANT_Results/
# in some conda environment or setting with geNomad available
genomad end-to-end Streptococcus_pyogenes_M1_GAS.fna geNomad_Results/ /path/to/genomad_dbs/
Next, we can setup the precompiled database of Streptococcus representative genome using prepTG:
# in zol's conda environment or via the Docker wrapper:
prepTG -d Streptococcus -o Streptococcus_Reps_prepTG_Database/
Now we are ready to run atpoc!
atpoc -i Streptococcus_pyogenes_M1_GAS.fna -tg Streptococcus_Reps_prepTG_Database/ -ps PhiSpy_Results/ -vi VIBRANT_Results/ -gn geNomad_Results/ -o atpoc_Results/ -c 20
Note, this can take a while as it will involve running fai X times (where X is the number of phage predictions across all methods in the focal sample of interest).
Similar to fai and zol's major results, atpoc also primarily produces an XLSX spreadsheet. On the first tab of atpoc's resulting XLSX spreadsheet, is an overview of the focal sample's prophage predictions from the different software:
Then on the second tab, the coverage of the focal sample's phage-ome across the genomes in the target genomes database is shown:
usage: atpoc [-h] -i SAMPLE_GENOME [-vi VIBRANT_RESULTS] [-ps PHISPY_RESULTS] [-gn GENOMAD_RESULTS] -tg TARGET_GENOMES_DB [-up] [-fo FAI_OPTIONS] [-s] [-si SIMPLE_BLASTP_IDENTITY_CUTOFF]
[-sc SIMPLE_BLASTP_COVERAGE_CUTOFF] [-se SIMPLE_BLASTP_EVALUE_CUTOFF] [-sm SIMPLE_BLASTP_SENSITIVITY_MODE] -o OUTDIR [-c CPUS]
Program: atpoc
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
atpoc - Assess Temperate Phage-Ome Conservation
atpoc wraps fai to assess the conservation of a sample's integrated/temperate phage-ome
relative to a set of target genomes (e.g. genomes belonging to the same genus). Alternatively,
it can run a simple DIAMOND BLASTp analysis to just assess the presence of prophage genes
individually - without the requirement they are co-located like in the focal sample.
options:
-h, --help show this help message and exit
-i SAMPLE_GENOME, --sample_genome SAMPLE_GENOME
Path to sample genome in GenBank or FASTA format.
-vi VIBRANT_RESULTS, --vibrant_results VIBRANT_RESULTS
Path to VIBRANT results directory for a single sample/genome.
-ps PHISPY_RESULTS, --phispy_results PHISPY_RESULTS
Path to PhiSpy results directory for a single sample/genome.
-gn GENOMAD_RESULTS, --genomad_results GENOMAD_RESULTS
Path to GeNomad results directory for a single sample/genome.
-tg TARGET_GENOMES_DB, --target_genomes_db TARGET_GENOMES_DB
prepTG database directory for target genomes of interest.
-up, --use_pyrodigal Use default pyrodigal instead of prodigal-gv to call genes in
phage regions to use as queries in fai/simple-blast. This
is perhaps preferable if target genomes db was created with default pyrodigal/prodigal.
-fo FAI_OPTIONS, --fai_options FAI_OPTIONS
Provide fai options to run. Should be surrounded by quotes. [Default is "-e 1e-10 -m 0.5 -dm -sct 0.4"]
-s, --use_simple_blastp
Use a simple DIAMOND BLASTp search with no requirement for co-localization of hits.
-si SIMPLE_BLASTP_IDENTITY_CUTOFF, --simple_blastp_identity_cutoff SIMPLE_BLASTP_IDENTITY_CUTOFF
If simple BLASTp mode requested : cutoff for identity between query proteins and matches in target genomes [Default is 40.0].
-sc SIMPLE_BLASTP_COVERAGE_CUTOFF, --simple_blastp_coverage_cutoff SIMPLE_BLASTP_COVERAGE_CUTOFF
If simple BLASTp mode requested : cutoff for coverage between query proteins and matches in target genomes [Default is 70.0].
-se SIMPLE_BLASTP_EVALUE_CUTOFF, --simple_blastp_evalue_cutoff SIMPLE_BLASTP_EVALUE_CUTOFF
If simple BLASTp mode requested : cutoff for E-value between query proteins and matches in target genomes [Default is 1e-10].
-sm SIMPLE_BLASTP_SENSITIVITY_MODE, --simple_blastp_sensitivity_mode SIMPLE_BLASTP_SENSITIVITY_MODE
Sensitivity mode for DIAMOND BLASTp. [Default is "very-sensititve"].
-o OUTDIR, --outdir OUTDIR
Output directory.
-c CPUS, --cpus CPUS The number of CPUs to use.