-
Notifications
You must be signed in to change notification settings - Fork 9
4: Parsing Loci
SuperCRUNCH can be used to find sequences based on a list of taxon names and a list of locus search terms, and build individual fasta files for each locus included. These fasta files are the targets for all downstream steps, including similarity searches, sequence selection, and alignment. The taxon list is covered in great detail in the Assessing Taxonomy section, including how searches work with and without subspecies, whereas the locus search terms are covered on this page. Recommendations for best practices to find loci are provided here, and complete instructions for parsing loci using the Parse_Loci.py
module can also be found here.
New in v1.3.0: A negative search term can now be included in the search terms file, in addition to the typical abbreviation and description terms.
SuperCRUNCH requires a list of loci and associated search terms to initially identify the contents of sequence records. For each locus included in the list, SuperCRUNCH will search for associated gene abbreviation and description in the sequence record. The choice of loci to include will inherently be group-specific, and surveys of phylogenetic and phylogeographic papers may help to identify an appropriate marker set. There is no limit to the number of loci, and SuperCRUNCH can also be used to search for large genomic data sets available on the NCBI nucleotide database, such as those obtained through sequence capture experiments (UCEs, anchored enrichment, etc.). For detailed instructions on searching for UCEs, see the section below.
The format of the locus text file is tab-delimited. Prior to v1.3.0, this was a three-column text file. With v.1.3+, a fourth column should be included that represents negative search terms. This new feature is backwards compatible - the three-column file will still work, but it is not recommended. For a three-column file, the default null value for the negative search term (N/A
) is automatically generated for each gene included. When creating a search terms file, you should only use the new four-column text file format.
-
Column One: The first column contains the locus name that will be used to label output files. It must not contain any spaces or special characters (including underscores and dashes), and should be kept short and simple.
-
Column Two: The second column contains the known abbreviation(s) for the gene or marker. Abbreviations should not include any spaces, but can contain numbers and other characters (like dashes). The second column can contain multiple abbreviations, which should be separated by a semi-colon, with no extra spaces between abbreviations.
-
Column Three: The third column contains a longer label of the gene or marker, such as its full name or description. This third column can also contain multiple search entries, which also should be separated by a semi-colon with no extra spaces between label entries. Because of the way searching works, the locus description terms should not be distinguished by any of the following characters:
,
,;
,:
,(
, or)
. They can contain them, but these characters will be automatically removed from the descriptions when they are read by the program. This character removal also happens to the sequence description lines. -
Column Four: The fourth column contains any negative search terms. If these terms are found in any record, they will be excluded, regardless of whether they match the abbreviation and description terms provided in columns two and three. If no negative term is desired, the null value of
N/A
should be provided in this column.
The abbreviations and labels are not case-specific, as they are converted to uppercase during actual searches, along with the description lines of the sequences.
The success of finding loci depends on defining appropriate locus abbreviations and labels. Examples of how searches operate can be found below, which should help guide how to select good search terms. For any locus, it is a good idea to search for the locus on on GenBank and examine several records to identify the common ways it is labeled.
Here is an example of the formatting for a locus search terms file containing three genes to search for. The columns are as follows (a header should not be included in the actual file).
Label Abbreviation Description Negative
Terms:
CMOS CMOS;C-MOS oocyte maturation factor pseudogene
EXPH5 EXPH5 exophilin;exophilin 5;exophilin-5;exophilin protein 5 N/A
PTPN PTPN;PTPN12 protein tyrosine phosphatase;tyrosine phosphatase non-receptor type 12 N/A
In this example:
-
CMOS contains two abbreviations, one description search term, and one negative search term.
-
EXPH5 contains one abbreviation, four description search terms, and a null value for the negative search term (no negative term is used).
-
PTPN contains two abbreviations, two description search terms. and a null value for the negative search term (no negative term is used).
Searches for loci are conducted using the Parse_Loci.py
module. The sequence records are parsed to obtain metadata that is entered into an SQL database. Among other information, the description line (minus accession and taxon name) is entered as one column in the database. Before it is entered, the description line is converted to uppercase and stripped of the following characters: ,
, ;
, :
, (
, and )
. Here is an example of how the descriptions of four sequence records are transformed before they are entered into the SQL database.
Initial records:
>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds
>AF154147.1 Sceloporus jarrovii oberon strain 29a 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product
>AM055650.1 Rieppeleon brachyurus mitochondrial partial 12S rRNA gene, specimen voucher MHNG 2624.082
>HF570740.1 Calumma brevicornis partial RAG1 gene for recombination activating protein 1, specimen voucher ZSM:549/2001
Transformed description components:
ISOLATE LJAMM-CNP 983 OOCYTE MATURATION FACTOR C-MOS GENE PARTIAL CDS
OBERON STRAIN 29A 12S RIBOSOMAL RNA GENE PARTIAL SEQUENCE MITOCHONDRIAL GENE FOR MITOCHONDRIAL PRODUCT
MITOCHONDRIAL PARTIAL 12S RRNA GENE SPECIMEN VOUCHER MHNG 2624.082
PARTIAL RAG1 GENE FOR RECOMBINATION ACTIVATING PROTEIN 1 SPECIMEN VOUCHER ZSM549/2001
Note the ,
, ;
, :
, (
, and )
have been removed, and the lines are in uppercase.
For each locus, a search is conducted for every abbreviation and gene description included. Let's take one gene as an example, and assume this is the contents of the locus file:
Label Abbreviation Description Negative
Terms:
CMOS CMOS;C-MOS oocyte maturation factor N/A
This would result in a search for the following:
Abbreviations:
' CMOS '
' C-MOS '
Description:
'OOCYTE MATURATION FACTOR'
Negative:
N/A
Note that the search terms are also converted to uppercase, and that the abbreviations receive an extra space character on both sides.
For those familiar with SQL, the actual search query is like so:
SELECT * FROM records WHERE description LIKE '%searchterm%' AND description NOT LIKE '%negative%'"
In this instance, the searchterm refers to one of the three search terms (abbreviations and descriptions) we found for locus CMOS. The %
characters allow any characters to be present outside of the term itself. The negative refers to the negative search term. In this case, the null value is being used (N/A
), which should not match anything in the record. If a different value was provided (e.g., pseudogene
), then this could match some records and exclude them from the SQL results returned.
For abbreviations, this will allow matches to:
ANYTHING term ANYTHING
but not:
ANYTHINGtermANYTHING
This is because the abbreviation search terms contain an extra space on either side, and that must also be present in the description for a match to be found. This prevents spurious matches, for example if a word on the description line happens to contain a substring of cmos
by chance. Because the parentheses are removed from description lines, this allow matches to a gene abbreviation whether or not it was included in parentheses in the original description line. For example, both of these hypothetical records would result in a match for the abbreviation C-MOS
(but not CMOS
!):
>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds
JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 c-mos oocyte maturation factor gene
In contrast, for the gene descriptions matches are found for:
ANYTHINGgene descriptionANYTHING
This allows more flexibility. In the following (somewhat unrealistic) scenario, both descriptions would result in a match for the term oocyte maturation factor
:
voucher BYU47312 oocyte maturation factor Mos (C-mos) gene
voucher BYU47312oocyte maturation factorMos (C-mos) gene
That is, it doesn't matter what immediately surrounds the search term. Spaces or other characters are both fine, as long as the term is contained somewhere within the larger description line. Given the variation in gene descriptions, this allows more flexibility for the search terms used. To illustrate this point, let's look at one more example for the gene AKAP9. Here are a few of the records:
>KJ363550.1 Phrynocephalus interscapularis sogdianus A-kinase anchor protein 9 (AKAP9) gene, partial cds
>KU765336.1 Sceloporus adleri voucher UWBM6608 A kinase anchor protein 9 (akap9) gene, partial cds
Notice that some records contain:
A-kinase anchor protein 9
And others contain:
A kinase anchor protein 9
In the above example, the search term kinase anchor protein 9
would be matched in both cases, so including additional terms like A-kinase anchor protein 9
and A kinase anchor protein 9
is unnecessary.
This is a new feature starting in SuperCRUNCH v1.3.0. For the record searches, if the negative search term is found in the record description, it will be excluded regardless of whether it matches the gene abbreviation and other description terms.
Using the example search terms file below, I will quickly demonstrate how this works: Terms:
CMOS CMOS;C-MOS oocyte maturation factor pseudogene
Here, CMOS contains two abbreviations, one description search term, and one negative search term.
Below is a made-up example of two records, one which we want (it is the gene), and one we don't want (it is labeled as a pseudogene, and likely paralogous).
>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds
>JX96951XXX Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) pseudogene, partial cds
Here both records will match the C-MOS
abbreviation and the oocyte maturation factor
description. However, fake record JX96951XXX
contains the negative search term pseudogene
. As a result, it will be discarded. The result is that only JX969518.1
will be selected from this set of two records.
Given the powerful nature of the negative search term, it is important to consider what is appropriate to include. So far, I have only used the negative search term of pseudogene
, which has been helpful. By contrast, if you use a widespread term such as gene
, you will probably exclude every single record.
If no negative search term should be used, the null value of N/A
should be used in the fourth column.
Given the variation in gene abbreviations and descriptions, it is best to use a combination of search terms to obtain the best results. I recommend searching for the gene on GenBank and inspecting several records to identify the best set of search terms, and start there. When the Parse_Loci.py
module is run, the number of record matches for every search term is displayed on-screen. For example:
Searching for CMOS:
There are 3 total search terms to use.
No negative search terms included for sequence descriptions.
683 records found using term: ' CMOS '
554 records found using term: ' C-MOS '
1,002 records found using term: 'OOCYTE MATURATION FACTOR'
1,304 total unique records found for CMOS.
Writing 1,304 records to CMOS.fasta
Elapsed time: 0:00:00.347526 (H:M:S).
This way, you'll know exactly how many records were found for each search term. In this case, 683, 554, and 1,002 records were found using each of the three search terms for CMOS (' CMOS '
, ' C-MOS '
, and 'OOCYTE MATURATION FACTOR'
). When this record pool was combined and redundant matches removed, it resulted in a total of 1,304 unique records matched.
For a more complex search, in which 16S is being targeted as well as whole mitogenomes, this is particularly helpful:
Searching for MTDNA_16S:
There are 9 total search terms to use.
Excluding records with these terms in sequence description: PSEUDOGENE
45,288 records found using term: ' 16S '
43,875 records found using term: '16S RIBOSOMAL RNA'
7,241 records found using term: 'LARGE SUBUNIT RIBOSOMAL'
1,886 records found using term: '16S RRNA'
360 records found using term: 'MITOCHONDRION COMPLETE GENOME'
31 records found using term: 'MITOCHONDRIAL DNA COMPLETE GENOME'
217 records found using term: 'MITOCHONDRION PARTIAL GENOME'
2 records found using term: 'PARTIAL MITOCHONDRIAL GENOME'
1 records found using term: 'COMPLETE MITOCHONDRIAL GENOME'
54,001 total unique records found for MTDNA_16S.
Writing 54,001 records to MTDNA_16S.fasta
Elapsed time: 0:00:04.728652 (H:M:S).
The Parse_Loci.py
module is trivial to run multiple times after the SQL database is constructed from the records. If you think there are some records not being recovered, you can add search terms to a locus and re-run the search. If the total changes, you'll know you recovered additional records or not. If you choose to re-run this step multiple times, just make sure you specify a different output directory each time!
SuperCRUNCH has the ability to slice out sequences from larger records - included whole mitochondrial genomes. If your starting sequence set contains complex mtDNA records (long stretches of mtDNA composed of multiple loci) or whole mitogenomes, you'll want to target these in your searches for any mitochondrial gene. Let's say we want to find all ND2 sequences.
We can include the following abbreviation:
ND2
And we can include the following locus description terms:
dehydrogenase subunit 2
mitochondrion complete genome
mitochondrial DNA complete genome
mitochondrion partial genome
And it would be a good idea to include the following negative search term:
pseudogene
So the search terms file would look like this:
ND2 ND2 dehydrogenase subunit 2;mitochondrion complete genome;mitochondrial DNA complete genome;mitochondrion partial genome
pseudogene
This will include ND2-specific records as well as larger mtDNA fragments that may contain partial or full ND2 sequence. It will also exclude any ND2 pseudogenes from being selected. We can easily extract the ND2 region from all these record types (ND2 records and whole mtdna genomes) in a downstream step (see Similarity Searches and Filtering).
The strategy for obtaining sets of UCE sequences is a little different from the smaller locus sets. To generate a locus search terms file, I retrieved the uce names from the uce-5k-probes.fasta file located here. Unfortunately, there does not appear to be a standard naming convention for the UCE loci on GenBank. If the sequences have been properly curated, then the description lines should contain the uce name somewhere (uce-10, uce-453, uce-5810, etc). If so, they will be compatible with the the 5k UCE locus search terms file (Locus-Search-Terms_UCE_5k_set.txt
) I've made available in the data folder here.
Here are partial contents from a UCE locus search term file:
uce-5805 uce-5805 xxxxxxxxxxxx N/A
uce-5806 uce-5806 xxxxxxxxxxxx N/A
uce-5808 uce-5808 xxxxxxxxxxxx N/A
uce-5810 uce-5810 xxxxxxxxxxxx N/A
Notice that the third column always contains a junk term. Unfortunately, UCE loci have been numbered in a suboptimal way. For example, the label uce-1
is used instead of uce-0001
. This causes problems when searching for locus descriptions, because the term uce-1
is a substring that will be found inside many other uce terms (uce-10
, uce-104
, uce-1638
, etc). Because of this, the locus description search will not work properly, and we have to rely exclusively on the abbreviation to find the correct sequences. A negative search term is not used here either, so the fourth column is filled with N/A
.
Here are some properly labeled GenBank records for UCE sequences:
>KY160876.1 Kaloula kalingensis voucher RMB1887 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTC
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...
>KY160875.1 Kaloula pulchra voucher KU328278 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTTAAGGGAATCATTGTTG
CCATTTTATGCACTCTATTTTAAAATGCATACAGTGGTAGAACAGATGTGTTTTTTTAACCCCATAG...
>KY160874.1 Kaloula picta voucher KU321376 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTG
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...
>KY160873.1 Kaloula conjuncta negrosensis voucher KU328639 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTG
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...
The locus abbreviation search terms successfully retrieved the corresponding loci in this sequence set. If the sequence records you are working with contain a UCE label that follows uce-number
, then the 5K UCE search terms file should also work for your dataset.
To implement the locus search terms to find sequences and write the gene-specific fasta files, the Parse_Loci.py
module can be used. The details of this module are provided below.
This module can be used to find sequences in a fasta file based on a list of taxon names and a locus search terms file, and to write those sequences to locus-specific fasta files. For a sequence to be written to a locus-specific fasta file, it must match either the gene abbreviation or description for that locus AND have a taxon label that is present in the taxon names list. The taxon names list can contain a mix of species (two-part) and subspecies (three-part) names. The --no_subspecies
flag can be used to only include species names in searches. In this case, only species names will be considered for records, regardless of whether or not they have a valid subspecies name. Details on the taxon list, taxon searches, and subspecies options are provided in the Assessing Taxonomy section. In particular, the "Taxonomy searches with and without subspecies" section is helpful.
All searches occur using SQL (via sqlite3), and an SQL database is constructed from the input file (-i
) in the output directory specified (-o
). If the database for the input file has already been made, the full path to it can be specified using the --sql_db
flag, which will save time for multiple runs on very large fasta files. All output fasta files and a summary file are written to their relevant directories within the output directory specified (-o
). The records written to the locus-specific fasta files will contain modified labels, with details described below.
If particular sequences should be excluded from the searches, a path to a text file with a list of relevant accession numbers (one per line) for these sequences can be specified using the --exclude
flag.
python Parse_Loci.py -i <fasta file> -o <output directory> -l <locus term file> -t <taxon file>
Required: The full path to a fasta file of sequence data.
Required: The full path to an existing directory to write output files.
Required: The full path to a text file containing loci information to search for within the fasta file.
Required: The full path to a text file containing all taxon names to cross-reference in the fasta file.
Optional: Ignore subspecies labels in searches and only write species names in the updated description lines for sequences.
Optional: The full path to a text file containing a list of accession numbers to ignore during searches.
Optional: The full path to the sql database to use for searches. Assumes the database was created with this module for the input fasta file being used.
python Parse_Loci.py -i bin/Loci/Merged.fasta -l bin/Loci/locus_search_terms.txt -t bin/Loci/Taxa_List.txt -o bin/Loci/Output/
Above command will use the locus search terms file
locus_search_terms.txt
and the taxon names fileTaxa_List.txt
to parse records inMerged.fasta
, writing outputs to the specified directory.
python Parse_Loci.py -i bin/Loci/Merged.fasta -l bin/Loci/locus_search_terms.txt -t bin/Loci/Taxa_List.txt -o bin/Loci/Output/ --no_subspecies
Above command will use the locus search terms file
locus_search_terms.txt
and the taxon names fileTaxa_List.txt
to parse records inMerged.fasta
, ignoring the subspecies component of taxon names. Outputs are written to the specified directory.
Several outputs are created in the output directory specified.
-
Directory
/Parsed-Fasta-Files
- This folder contains fasta files for all loci included in the locus search terms file, for which at least one sequence was found. -
File
[NAME].sql.db
- The SQL database produced from the input fasta fileNAME.fasta
. -
Directory
/Summary-File
- This folder contains a single file calledLoci_Record_Counts.txt
. This is a log file summarizing the number of records written per locus. An example of the contents of this file is shown below:
Locus_Name Records_Written
BDNF 1246
CMOS 1263
CXCR4 164
EXPH5 650
KIAA1549 0
In the example above, the files BDNF.fasta
, CMOS.fasta
, CXCR4.fasta
, and EXPH5.fasta
will be written to the output directory, but not KIAA1549.fasta
because no sequences were found.
There are two main items that can greatly affect the outcome of this step:
- the taxon list
- the locus search terms file
For any sequence to be written to a fasta file, it must match a taxon name present in the taxon list. It is important to understand if the user-supplied taxonomy is compatible with the sequence records. For details on this, please see the Assessing Taxonomy section. For information about obtaining and using locus search terms, please see the above sections on this page. Poor matches to the taxonomy or the locus search terms can result in suboptimal results, with many potentially useful sequences not being found during searches.
When a sequence matching a taxon name and locus search term is found, it will be written to the locus-specific fasta file. However, the original record label will be modified. The accession number and appropriate taxon label (depending on the --no_subspecies
flag) will be included, but the remaining description line is denoted by a DESCRIPTION
phrase and may be prefaced with a newly created Voucher_
flag. The voucher tag is created automatically and can be used in downstream steps to create 'vouchered' datasets. The voucher tag can only be created if the sequence record contains a voucher
, isolate
, or strain
field. The following example shows how the records are modified, both in terms of the description and the voucher tags.
Initial records:
>DQ237481.1 Liolaemus kingii voucher LJAMM3040 oocyte maturation factor Mos (C-mos) gene, partial cds
TGCAGTAAGAATAGTTTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAATGTGGCACATCTTGACCATAACAATGTGGTACGTGTT...
>DQ340659.1 Lophognathus gilberti isolate ABTC12010 oocyte maturation factor Mos (c-mos) gene, partial cds
CTCCGTCTGGAATTTTCTCCATCTGTTAATGCACGACCCTGCAGTAGCCCCCTGGTATTTCCAACCAAAGGGGAAAAGTTATTTTCT...
>HF570650.1 Calumma guillaumeti partial cmos gene for oocyte maturation factor, specimen voucher ZSM:440/2005
CCTGCAGCAGTCCCCTGGTGTTTCCAGTCAAAGGTGGGAAGTTATTTTGTGAGGAAGGTCCTTCTCCTAGGGCTTGCCGGCTACCCC...
>HM352537.1 Sauromalus ater oocyte maturation factor Mos (C-mos) gene, partial cds
TGCAGTAAGAACAGTCTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAACGTGGCACATCTTGAACATAAAAATGTGGTACGTGTA...
Records written to the locus-specific fasta file:
>DQ237481.1 Liolaemus kingii Voucher_LJAMM3040 DESCRIPTION voucher ljamm3040 oocyte maturation factor mos c-mos gene partial cds
TGCAGTAAGAATAGTTTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAATGTGGCACATCTTGACCATAACAATGTGGTACGTGTT...
>DQ340659.1 Lophognathus gilberti Voucher_ABTC12010 DESCRIPTION isolate abtc12010 oocyte maturation factor mos c-mos gene partial cds
CTCCGTCTGGAATTTTCTCCATCTGTTAATGCACGACCCTGCAGTAGCCCCCTGGTATTTCCAACCAAAGGGGAAAAGTTATTTTCT...
>HF570650.1 Calumma guillaumeti Voucher_ZSM440/2005 DESCRIPTION partial cmos gene for oocyte maturation factor specimen voucher zsm440/2005
CCTGCAGCAGTCCCCTGGTGTTTCCAGTCAAAGGTGGGAAGTTATTTTGTGAGGAAGGTCCTTCTCCTAGGGCTTGCCGGCTACCCC...
>HM352537.1 Sauromalus ater DESCRIPTION oocyte maturation factor mos c-mos gene partial cds
TGCAGTAAGAACAGTCTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAACGTGGCACATCTTGAACATAAAAATGTGGTACGTGTA...
Notice that the fourth record did not receive a Voucher_
flag because it lacked voucher
, isolate
, and strain
fields in the original description.
It is not necessary to understand the exact details of how the record labels are modified, but it is important to understand they will no longer be identical to the original labels. These modifications allow SuperCRUNCH to quickly find information in downstream steps.
Last updated: January, 2021