-
Notifications
You must be signed in to change notification settings - Fork 4
5. tutorial ‐ a detailed walkthrough
This tutorial will simply walkthrough the test commands in the run_tests.sh
script to test proper installation and showcase various features in the suite. The test data pertains to the enterococcal polysaccharide antigen encoding locus epa in Enterococcus faecalis and Enterococcus faecium. We perform a more thorough examination of this gene cluster in the zol manuscript.
Lets begin by downloading the test dataset:
wget https://github.com/Kalan-Lab/zol/raw/main/test_case.tar.gz
Next, we can uncompress it and change directories into it:
tar -zxvf test_case.tar.gz
cd test_case/
Before we use fai to identify homologous or orthologous instances of gene clusters in a set of target genomes, we must format the target genomes using prepTG.
prepTG takes a directory of genomic assemblies as input, in either FASTA or GenBank format. If GenBank format is provided, the expectation is that CDS features are included (in other words it is a full GenBank). If FASTA files are provided the default operation is to use pyrodigal for bacterial gene calling, however, options exist to use prodigal or minimap instead. minimap was incorporated specifically for eukaryotic genomes provided in FASTA format, where users can also provide a predicted proteome file (another FASTA file with protein sequences) of a reference genome to map those to the remainder of the genomes. We took such an approach in the zol manuscript to map high-quality coding sequence predictions from a reference Aspergillus flavus genome to the remainder of Aspergillus flavus genomes available in NCBI but lacking coding sequence predictions. For prodigal and pyrodigal, usage metagenomics gene calling mode is also available as a configurable option.
Our target genomes, where we will be searching for the epa locus can be found in the subdirectory: Target_Genomes/
. Because we are dealing with bacterial genomes - we can simply run prepTG with default settings as such:
prepTG -i Target_Genomes/ -o prepTG_Database/ -c 4 -cst
The -c
option in this case controls the number of parallel threads to use.
The
-cst
option (new in v1.3.7+) creates a species tree of the target genomes from skani-based ANI estimates (works best for within a species or less diverse genus) + neighbor-joining. This species tree can then be used for visualization purposes downstream in fai. The resulting file will be in the prepTG database folder and namedSpecies_Tree.nwk
.
This might take a while, but we have uploaded premade databases of representative genomes for select bacterial taxa (see the premade prepTG dbs wiki page for further info).
Starting in version 1.3.7 of the suite, users can simply issue the command:
prepTG -d Enterococcus -o prepTG_Database/
Note, databases are stored on Zenodo. Option should generally work, but download speeds might be slow at times.
Users should also be familiar with the ever so useful ncbi-genome-download
program which is included automatically in the zol conda environment.
GitHub link: https://github.com/kblin/ncbi-genome-download
# Example for downloading all Enterococcus genomes in RefSeq in FASTA format
ncbi-genome-download -F fasta -s refseq -g "Enterococcus" --flat-output -o RefSeq_Enterococcus_Genomes/ bacteria -p 4
fai accepts query gene clusters in multiple formats. For a description of these please see the basic usage examples wiki page.
In the test dataset, we already have a known instance of the epa gene cluster from E. faecalis provided as a protein FASTA (Epa_Proteins_from_MIBiG_GenBank.faa
) and in GenBank format (Epa_MIBiG_GBK/Epa_MIBiG_GenBank.gbk
).
We could also manually assemble a protein FASTA file from literature or online database such as NCBI or KEGG. We could also download the GenBank for the gene cluster directly from MIBiG using wget: wget https://mibig.secondarymetabolites.org/repository/BGC0000792/BGC0000792.gbk
. Alternatively, we could provide a coordinate for the epa along some reference genome, e.g. the E. faecalis* V583 genome (also provided in the workspace, Efaecalis_V583_Genome.fasta
). This enables users to
Here is a quick look at how using these options might look.
Using a query provided as a FASTA file of proteins
fai -pq Epa_Proteins_from_MIBiG_GenBank.faa -tg prepTG_Database/ -o fai_Results/
A special case is using a single gene as a query via the
-sq
option - which gains inspiration from CORASON/EvoMining - more support for this option might be developed in the future.
Using a query provided as a GenBank file
fai -i Epa_MIBiG_GBK/Epa_MIBiG_GenBank.gbk -tg prepTG_Database/ -o fai_Results/
Using coordinates along a reference genome
fai -r Efaecalis_V583_Genome.fasta -rc NC_004668.1 -rs 2083902 -re 2115174 -tg prepTG_Database/ -o fai_Results/
Where are the results? 💥
The set of homologous gene cluster instances identified in target genomes will be located in the subdirectory: fai_Results/Homologous_GenBanks_Directory/
Here is a quick overview of key options in fai a user should consider. Additional details around the algorithms fai uses to delineate gene can be found on the more info on fai wiki page.
-
-dm
or--draft_mode
: specifying this option will enable draft-assembly mode searching, which allows for looser search criteria and assumptions that the gene cluster might be split up across multiple scaffolds/contigs. False positives are also likely to be incurred. -
-fp
or--filter_paralogs
: specifying this option will specify to filter secondary hits (which might be paralogous instances of the gene cluster) if multiple, overlapping hits for the gene cluster are found in individual target genomes. -
-e
or--evalue
: The e-value cutoff for whether to consider a gene in a target genome exhibits similarity to a protein from the query gene cluster using DIAMOND blastp. Default value is 1e-10. -
-m
or--min_prop
: The minimum proportion of genes from the query gene cluster needed to report a homologous instance of a query gene cluster. Note genes are actually collapsed into distinct alleles upfront, so this is the minimum proportion of distinct alleles. Default value is 0.5. -
-mgd
or--max_genes_disconnect
: The maximum number of CDS features separating two genes in a target genome which exhibit sufficient similarity to a protein from the query gene cluster for them to be considered as part of the same cluster instance. Default is 5. -
-kpq
or--key_protein_queries
: A FASTA protein file can be provided with individual proteins that are special and that users can specify separate e-value or conservation thresholds for gene cluster detection using (via the-kpe
and-kpm
options). E.g. are you looking for an NRPS-type BGC and there are three key NRPS genes, then you can specify them separately and make your search more stringent while loosening up criteria for detection of auxiliary gene cluster components. -
-sct
or--synteic_correlation_threshold
: If a GenBank or coordinates along a reference genome are provided, a syntenic similarity assessment between detected gene clusters and the query gene cluster will be performed based on global correlation gene order similarity. Values closer to 1 are more stringent whereas a value of 0 implies no syntenic filter should be applied. The default value is 0.6. -
-gdm
or--gc_delineation_mode
: The gene cluster delineation mode. There are basically two options, for most users we recommend the default setting. -
-f
or--flanking_context
: The flanking of a homologous instance of the gene cluster identified in a target genome to include in the resulting GenBank output, useful to explore conservation and annotation of surrounding contexts of the gene cluster downstream in zol. Default is 1000 bp. -
-gp
or--generate_plots
: Whether to generate a PDF with plots showcasing the similarity of detected homologous gene cluster instances from target genomes to the query gene cluster. Looks like the following, each bar is a gene along the target genome, the height corresponds to the ratio of the protein in the target genome to the best matching protein from the query gene cluster. The color corresponds to the percent identity (more red = higher identity).
-
-st
or--species_tree
: providing this option will allow using a species tree of the target genomes to construct a tree-heatmap figure showing whether the presence of the query gene cluster is widespread or clade specific. It will look something like the following, with darker values indicating a higher bitscore between the query and target proteins:
Check out cblaster and CAGECAT as alternatives to fai for finding homologous instances of a query gene cluster
zol just takes a set of GenBanks as input and it is definitely possible to use cblaster by Gilchrist et al. 2021 instead of fai to gather gene-clusters in GenBank format; for instance:
# use cblaster to search for homologous co-clusters in NCBI genomes
cblaster search -qf queries.faa -s cblaster_results.json
# use cblaster to extract GenBannks of homologous gene-clusters detected
cblaster extract_clusters session.json -o example_directory/
CAGECAT by van den Belt et al. 2023 is an awesome webserver for running cblaster and clinker.
cblaster features the fantastic ability to use NCBI's computational infrastructure to perform alignment thereby minimizing the need for local resources.
Step 4: Generate a tabular report with evolutionary trends, annotation info, and conservation stats using zol 📃 📈
Finally, we can investigate the relationships between identified homologous instances of the query gene cluster using zol.
We can do this using the following basic command:
zol -i fai_Results/Homologous_GenBanks_Directory/ -o zol_Results/ -c 10
Once more -c
is just specifying the number of threads to use.
I highly recommend using the option -cd
or --custom_database
to provide proteins from the query gene cluster used for fai as an extra database for annotation. This will map these proteins to the ortholog groups so you have a point of reference to the original query gene cluster.
zol -i fai_Results/Homologous_GenBanks_Directory/ -cd Epa_Proteins_from_MIBiG_GenBank.faa -o zol_Results/ -c 10
In the final Excel spreadsheet generated, one of the columns will now feature annotations from the custom database (protein FASTA file) provided.
Where are the results? 💥
The final results, once zol is done, can be found in the subdirectory at zol_Results/Final_Results/
with the major results file being the XLSX spreadsheet Consolidated_Report.xlsx
.
Here are a list and brief description of key options to consider in zol:
-
it
oridentity_threshold
,ct
orcoverage_threshold
, andet
orevalue_threshold
: These are key thresholds used for InParnoid-type ortholog grouping in zol. The default values might be appropriate for some gene clusters being studied but looser or more stringent criteria might be beneficial to your analysis for other gene clusters. -
-fl
or--filter_low_quality
: Filter out gene clusters which feature alot of missing bases (>10%). I almost always issue this, but it is not turned on by default. -
-fd
or--filter_draft_quality
: Filter out gene clusters which were found near scaffold edges. I also usually specify this, especially when interested in gene conservation. This is because it becomes tricky to assess whether a gene is missing simply because the gene cluster is fragmented due to assembly issues or because it is actually missing. If fai is used in "draft mode", users could provide multi-record GenBanks and choose to skip this argument, I have not tested the use of multi-record GenBanks much. -
-r
or--rename_lt
: If CDS features in input GenBanks don't havelocus_tag
identifiers, just generate them from scratch. -
-d
or--dereplicate
: Dereplicate gene clusters to remove highly similar versions. Can be used to reduce the complexity of the analysis. It is critical to use dereplication if you are dealing with thousands of genomes because otherwise orthology finding will take forever!!!. -
-ri
or--reinflate
: So how do you run zol with thousands of gene clusters? You use this flag, which tells zol to reinflate the ortholog groups determined using a dereplicated/represntative set of gene clusters with proteins from the full set of gene clusters provided as input. Note, this reinflation currently works for grouping proteins which are very similar to an already (ortholog) grouped protein from the representative gene clusters. dereplication must be specified alongside this option.
A key feature of zol is the option to dereplicate gene clusters.