Skip to content

4: Parsing Loci

Daniel Portik edited this page Feb 2, 2021 · 25 revisions

Parsing Loci


Overview

SuperCRUNCH can be used to find sequences based on a list of taxon names and a list of locus search terms, and build individual fasta files for each locus included. These fasta files are the targets for all downstream steps, including similarity searches, sequence selection, and alignment. The taxon list is covered in great detail in the Assessing Taxonomy section, including how searches work with and without subspecies, whereas the locus search terms are covered on this page. Recommendations for best practices to find loci are provided here, and complete instructions for parsing loci using the Parse_Loci.py module can also be found here.

New in v1.3.0: A negative search term can now be included in the search terms file, in addition to the typical abbreviation and description terms.


Obtaining Locus Search Terms

SuperCRUNCH requires a list of loci and associated search terms to initially identify the contents of sequence records. For each locus included in the list, SuperCRUNCH will search for associated gene abbreviation and description in the sequence record. The choice of loci to include will inherently be group-specific, and surveys of phylogenetic and phylogeographic papers may help to identify an appropriate marker set. There is no limit to the number of loci, and SuperCRUNCH can also be used to search for large genomic data sets available on the NCBI nucleotide database, such as those obtained through sequence capture experiments (UCEs, anchored enrichment, etc.). For detailed instructions on searching for UCEs, see the section below.

The format of the locus text file is tab-delimited. Prior to v1.3.0, this was a three-column text file. With v.1.3+, a fourth column should be included that represents negative search terms. This new feature is backwards compatible - the three-column file will still work, but it is not recommended. For a three-column file, the default null value for the negative search term (N/A) is automatically generated for each gene included. When creating a search terms file, you should only use the new four-column text file format.

  • Column One: The first column contains the locus name that will be used to label output files. It must not contain any spaces or special characters (including underscores and dashes), and should be kept short and simple.

  • Column Two: The second column contains the known abbreviation(s) for the gene or marker. Abbreviations should not include any spaces, but can contain numbers and other characters (like dashes). The second column can contain multiple abbreviations, which should be separated by a semi-colon, with no extra spaces between abbreviations.

  • Column Three: The third column contains a longer label of the gene or marker, such as its full name or description. This third column can also contain multiple search entries, which also should be separated by a semi-colon with no extra spaces between label entries. Because of the way searching works, the locus description terms should not be distinguished by any of the following characters: ,, ;, :, (, or ). They can contain them, but these characters will be automatically removed from the descriptions when they are read by the program. This character removal also happens to the sequence description lines.

  • Column Four: The fourth column contains any negative search terms. If these terms are found in any record, they will be excluded, regardless of whether they match the abbreviation and description terms provided in columns two and three. If no negative term is desired, the null value of N/A should be provided in this column.

The abbreviations and labels are not case-specific, as they are converted to uppercase during actual searches, along with the description lines of the sequences.

The success of finding loci depends on defining appropriate locus abbreviations and labels. Examples of how searches operate can be found below, which should help guide how to select good search terms. For any locus, it is a good idea to search for the locus on on GenBank and examine several records to identify the common ways it is labeled.

Here is an example of the formatting for a locus search terms file containing three genes to search for. The columns are as follows (a header should not be included in the actual file).

Label	Abbreviation	Description	Negative

Terms:

CMOS	CMOS;C-MOS	oocyte maturation factor	pseudogene
EXPH5	EXPH5	exophilin;exophilin 5;exophilin-5;exophilin protein 5	N/A
PTPN	PTPN;PTPN12	protein tyrosine phosphatase;tyrosine phosphatase non-receptor type 12	N/A

In this example:

  • CMOS contains two abbreviations, one description search term, and one negative search term.

  • EXPH5 contains one abbreviation, four description search terms, and a null value for the negative search term (no negative term is used).

  • PTPN contains two abbreviations, two description search terms. and a null value for the negative search term (no negative term is used).

Back to top


How do locus searches work?

Searches for loci are conducted using the Parse_Loci.py module. The sequence records are parsed to obtain metadata that is entered into an SQL database. Among other information, the description line (minus accession and taxon name) is entered as one column in the database. Before it is entered, the description line is converted to uppercase and stripped of the following characters: ,, ;, :, (, and ). Here is an example of how the descriptions of four sequence records are transformed before they are entered into the SQL database.

Initial records:

>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds

>AF154147.1 Sceloporus jarrovii oberon strain 29a 12S ribosomal RNA gene, partial sequence; mitochondrial gene for mitochondrial product

>AM055650.1 Rieppeleon brachyurus mitochondrial partial 12S rRNA gene, specimen voucher MHNG 2624.082

>HF570740.1 Calumma brevicornis partial RAG1 gene for recombination activating protein 1, specimen voucher ZSM:549/2001

Transformed description components:

ISOLATE LJAMM-CNP 983 OOCYTE MATURATION FACTOR C-MOS GENE PARTIAL CDS 

OBERON STRAIN 29A 12S RIBOSOMAL RNA GENE PARTIAL SEQUENCE MITOCHONDRIAL GENE FOR MITOCHONDRIAL PRODUCT

MITOCHONDRIAL PARTIAL 12S RRNA GENE SPECIMEN VOUCHER MHNG 2624.082 

PARTIAL RAG1 GENE FOR RECOMBINATION ACTIVATING PROTEIN 1 SPECIMEN VOUCHER ZSM549/2001 

Note the ,, ;, :, (, and ) have been removed, and the lines are in uppercase.

For each locus, a search is conducted for every abbreviation and gene description included. Let's take one gene as an example, and assume this is the contents of the locus file:

Label	Abbreviation	Description	Negative

Terms:

CMOS	CMOS;C-MOS	oocyte maturation factor	N/A

This would result in a search for the following:

Abbreviations: ' CMOS '

' C-MOS '

Description: 'OOCYTE MATURATION FACTOR'

Negative: N/A

Note that the search terms are also converted to uppercase, and that the abbreviations receive an extra space character on both sides.

For those familiar with SQL, the actual search query is like so:

SELECT * FROM records WHERE description LIKE '%searchterm%' AND description NOT LIKE '%negative%'"

In this instance, the searchterm refers to one of the three search terms (abbreviations and descriptions) we found for locus CMOS. The % characters allow any characters to be present outside of the term itself. The negative refers to the negative search term. In this case, the null value is being used (N/A), which should not match anything in the record. If a different value was provided (e.g., pseudogene), then this could match some records and exclude them from the SQL results returned.

For abbreviations, this will allow matches to:

ANYTHING term ANYTHING

but not:

ANYTHINGtermANYTHING

This is because the abbreviation search terms contain an extra space on either side, and that must also be present in the description for a match to be found. This prevents spurious matches, for example if a word on the description line happens to contain a substring of cmos by chance. Because the parentheses are removed from description lines, this allow matches to a gene abbreviation whether or not it was included in parentheses in the original description line. For example, both of these hypothetical records would result in a match for the abbreviation C-MOS (but not CMOS!):

>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds
JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 c-mos oocyte maturation factor gene

In contrast, for the gene descriptions matches are found for:

ANYTHINGgene descriptionANYTHING

This allows more flexibility. In the following (somewhat unrealistic) scenario, both descriptions would result in a match for the term oocyte maturation factor:

voucher BYU47312 oocyte maturation factor Mos (C-mos) gene

voucher BYU47312oocyte maturation factorMos (C-mos) gene

That is, it doesn't matter what immediately surrounds the search term. Spaces or other characters are both fine, as long as the term is contained somewhere within the larger description line. Given the variation in gene descriptions, this allows more flexibility for the search terms used. To illustrate this point, let's look at one more example for the gene AKAP9. Here are a few of the records:

>KJ363550.1 Phrynocephalus interscapularis sogdianus A-kinase anchor protein 9 (AKAP9) gene, partial cds

>KU765336.1 Sceloporus adleri voucher UWBM6608 A kinase anchor protein 9 (akap9) gene, partial cds

Notice that some records contain:

A-kinase anchor protein 9

And others contain:

A kinase anchor protein 9

In the above example, the search term kinase anchor protein 9 would be matched in both cases, so including additional terms like A-kinase anchor protein 9 and A kinase anchor protein 9 is unnecessary.

Back to top


Including a negative search term

This is a new feature starting in SuperCRUNCH v1.3.0. For the record searches, if the negative search term is found in the record description, it will be excluded regardless of whether it matches the gene abbreviation and other description terms.

Using the example search terms file below, I will quickly demonstrate how this works: Terms:

CMOS	CMOS;C-MOS	oocyte maturation factor	pseudogene

Here, CMOS contains two abbreviations, one description search term, and one negative search term.

Below is a made-up example of two records, one which we want (it is the gene), and one we don't want (it is labeled as a pseudogene, and likely paralogous).

>JX969518.1 Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) gene, partial cds

>JX96951XXX Phymaturus dorsimaculatus isolate LJAMM-CNP 983 oocyte maturation factor (c-mos) pseudogene, partial cds

Here both records will match the C-MOS abbreviation and the oocyte maturation factor description. However, fake record JX96951XXX contains the negative search term pseudogene. As a result, it will be discarded. The result is that only JX969518.1 will be selected from this set of two records.

Given the powerful nature of the negative search term, it is important to consider what is appropriate to include. So far, I have only used the negative search term of pseudogene, which has been helpful. By contrast, if you use a widespread term such as gene, you will probably exclude every single record.

If no negative search term should be used, the null value of N/A should be used in the fourth column.

Back to top


How do I know how well my search terms are performing?

Given the variation in gene abbreviations and descriptions, it is best to use a combination of search terms to obtain the best results. I recommend searching for the gene on GenBank and inspecting several records to identify the best set of search terms, and start there. When the Parse_Loci.py module is run, the number of record matches for every search term is displayed on-screen. For example:


Searching for CMOS:

	There are 3 total search terms to use.
	No negative search terms included for sequence descriptions.

	683 records found using term: ' CMOS '
	554 records found using term: ' C-MOS '
	1,002 records found using term: 'OOCYTE MATURATION FACTOR'

	1,304 total unique records found for CMOS.
	Writing 1,304 records to CMOS.fasta

	Elapsed time: 0:00:00.347526 (H:M:S).
	

This way, you'll know exactly how many records were found for each search term. In this case, 683, 554, and 1,002 records were found using each of the three search terms for CMOS (' CMOS ', ' C-MOS ', and 'OOCYTE MATURATION FACTOR'). When this record pool was combined and redundant matches removed, it resulted in a total of 1,304 unique records matched.

For a more complex search, in which 16S is being targeted as well as whole mitogenomes, this is particularly helpful:

Searching for MTDNA_16S:

	There are 9 total search terms to use.
	Excluding records with these terms in sequence description: PSEUDOGENE

	45,288 records found using term: ' 16S '
	43,875 records found using term: '16S RIBOSOMAL RNA'
	7,241 records found using term: 'LARGE SUBUNIT RIBOSOMAL'
	1,886 records found using term: '16S RRNA'
	360 records found using term: 'MITOCHONDRION COMPLETE GENOME'
	31 records found using term: 'MITOCHONDRIAL DNA COMPLETE GENOME'
	217 records found using term: 'MITOCHONDRION PARTIAL GENOME'
	2 records found using term: 'PARTIAL MITOCHONDRIAL GENOME'
	1 records found using term: 'COMPLETE MITOCHONDRIAL GENOME'

	54,001 total unique records found for MTDNA_16S.
	Writing 54,001 records to MTDNA_16S.fasta

	Elapsed time: 0:00:04.728652 (H:M:S).

The Parse_Loci.py module is trivial to run multiple times after the SQL database is constructed from the records. If you think there are some records not being recovered, you can add search terms to a locus and re-run the search. If the total changes, you'll know you recovered additional records or not. If you choose to re-run this step multiple times, just make sure you specify a different output directory each time!

Back to top

What's the best strategy for recovering mitochondrial genes?

SuperCRUNCH has the ability to slice out sequences from larger records - included whole mitochondrial genomes. If your starting sequence set contains complex mtDNA records (long stretches of mtDNA composed of multiple loci) or whole mitogenomes, you'll want to target these in your searches for any mitochondrial gene. Let's say we want to find all ND2 sequences.

We can include the following abbreviation:

ND2

And we can include the following locus description terms:

dehydrogenase subunit 2
mitochondrion complete genome
mitochondrial DNA complete genome
mitochondrion partial genome

And it would be a good idea to include the following negative search term:

pseudogene

So the search terms file would look like this:

ND2	ND2	dehydrogenase subunit 2;mitochondrion complete genome;mitochondrial DNA complete genome;mitochondrion partial genome
	pseudogene

This will include ND2-specific records as well as larger mtDNA fragments that may contain partial or full ND2 sequence. It will also exclude any ND2 pseudogenes from being selected. We can easily extract the ND2 region from all these record types (ND2 records and whole mtdna genomes) in a downstream step (see Similarity Searches and Filtering).


How do I search for UCE loci?

The strategy for obtaining sets of UCE sequences is a little different from the smaller locus sets. To generate a locus search terms file, I retrieved the uce names from the uce-5k-probes.fasta file located here. Unfortunately, there does not appear to be a standard naming convention for the UCE loci on GenBank. If the sequences have been properly curated, then the description lines should contain the uce name somewhere (uce-10, uce-453, uce-5810, etc). If so, they will be compatible with the the 5k UCE locus search terms file (Locus-Search-Terms_UCE_5k_set.txt) I've made available in the data folder here.

Here are partial contents from a UCE locus search term file:

uce-5805	uce-5805	xxxxxxxxxxxx	N/A
uce-5806	uce-5806	xxxxxxxxxxxx	N/A
uce-5808	uce-5808	xxxxxxxxxxxx	N/A
uce-5810	uce-5810	xxxxxxxxxxxx	N/A

Notice that the third column always contains a junk term. Unfortunately, UCE loci have been numbered in a suboptimal way. For example, the label uce-1 is used instead of uce-0001. This causes problems when searching for locus descriptions, because the term uce-1 is a substring that will be found inside many other uce terms (uce-10, uce-104, uce-1638, etc). Because of this, the locus description search will not work properly, and we have to rely exclusively on the abbreviation to find the correct sequences. A negative search term is not used here either, so the fourth column is filled with N/A.

Here are some properly labeled GenBank records for UCE sequences:

>KY160876.1 Kaloula kalingensis voucher RMB1887 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTC
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...

>KY160875.1 Kaloula pulchra voucher KU328278 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTTAAGGGAATCATTGTTG
CCATTTTATGCACTCTATTTTAAAATGCATACAGTGGTAGAACAGATGTGTTTTTTTAACCCCATAG...

>KY160874.1 Kaloula picta voucher KU321376 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTG
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...

>KY160873.1 Kaloula conjuncta negrosensis voucher KU328639 ultra conserved element locus uce-5806 genomic sequence
ATATTTGTGTTTATTTTCTACTTGTATTAATTGACAACATTTGCCTGTTGGCTCAAGGGAATCAGTGTTG
CCATTTTATGCACTCTATTTTAAAATGCAGACAGTGGTAGAACAGATGTGTTTTTTTTAACCCCATA...

The locus abbreviation search terms successfully retrieved the corresponding loci in this sequence set. If the sequence records you are working with contain a UCE label that follows uce-number, then the 5K UCE search terms file should also work for your dataset.

Back to top


Building Fasta Files for Loci Available

To implement the locus search terms to find sequences and write the gene-specific fasta files, the Parse_Loci.py module can be used. The details of this module are provided below.

Back to top


Parse_Loci.py

This module can be used to find sequences in a fasta file based on a list of taxon names and a locus search terms file, and to write those sequences to locus-specific fasta files. For a sequence to be written to a locus-specific fasta file, it must match either the gene abbreviation or description for that locus AND have a taxon label that is present in the taxon names list. The taxon names list can contain a mix of species (two-part) and subspecies (three-part) names. The --no_subspecies flag can be used to only include species names in searches. In this case, only species names will be considered for records, regardless of whether or not they have a valid subspecies name. Details on the taxon list, taxon searches, and subspecies options are provided in the Assessing Taxonomy section. In particular, the "Taxonomy searches with and without subspecies" section is helpful.

All searches occur using SQL (via sqlite3), and an SQL database is constructed from the input file (-i) in the output directory specified (-o). If the database for the input file has already been made, the full path to it can be specified using the --sql_db flag, which will save time for multiple runs on very large fasta files. All output fasta files and a summary file are written to their relevant directories within the output directory specified (-o). The records written to the locus-specific fasta files will contain modified labels, with details described below.

If particular sequences should be excluded from the searches, a path to a text file with a list of relevant accession numbers (one per line) for these sequences can be specified using the --exclude flag.

Basic Usage:

python Parse_Loci.py -i <fasta file> -o <output directory> -l <locus term file> -t <taxon file> 

Argument Explanations:

-i <path-to-file> or --input <path-to-file>

Required: The full path to a fasta file of sequence data.

-o <path-to-directory> or --outdir <path-to-directory>

Required: The full path to an existing directory to write output files.

-l <path-to-file> or --loci <path-to-file>

Required: The full path to a text file containing loci information to search for within the fasta file.

-t <path-to-file> or --taxa <path-to-file>

Required: The full path to a text file containing all taxon names to cross-reference in the fasta file.

--no_subspecies

Optional: Ignore subspecies labels in searches and only write species names in the updated description lines for sequences.

--exclude <path-to-file>

Optional: The full path to a text file containing a list of accession numbers to ignore during searches.

--sql_db <path-to-sql-database>

Optional: The full path to the sql database to use for searches. Assumes the database was created with this module for the input fasta file being used.

Example Usage:

python Parse_Loci.py -i bin/Loci/Merged.fasta -l bin/Loci/locus_search_terms.txt -t bin/Loci/Taxa_List.txt -o bin/Loci/Output/

Above command will use the locus search terms file locus_search_terms.txt and the taxon names file Taxa_List.txt to parse records in Merged.fasta, writing outputs to the specified directory.

python Parse_Loci.py -i bin/Loci/Merged.fasta -l bin/Loci/locus_search_terms.txt -t bin/Loci/Taxa_List.txt -o bin/Loci/Output/ --no_subspecies

Above command will use the locus search terms file locus_search_terms.txt and the taxon names file Taxa_List.txt to parse records in Merged.fasta, ignoring the subspecies component of taxon names. Outputs are written to the specified directory.

Outputs:

Several outputs are created in the output directory specified.

  • Directory /Parsed-Fasta-Files - This folder contains fasta files for all loci included in the locus search terms file, for which at least one sequence was found.

  • File [NAME].sql.db - The SQL database produced from the input fasta file NAME.fasta.

  • Directory /Summary-File - This folder contains a single file called Loci_Record_Counts.txt. This is a log file summarizing the number of records written per locus. An example of the contents of this file is shown below:

Locus_Name	Records_Written
BDNF	1246
CMOS	1263
CXCR4	164
EXPH5	650
KIAA1549	0

In the example above, the files BDNF.fasta, CMOS.fasta, CXCR4.fasta, and EXPH5.fasta will be written to the output directory, but not KIAA1549.fasta because no sequences were found.

Assessing the performance of locus parsing

There are two main items that can greatly affect the outcome of this step:

  • the taxon list
  • the locus search terms file

For any sequence to be written to a fasta file, it must match a taxon name present in the taxon list. It is important to understand if the user-supplied taxonomy is compatible with the sequence records. For details on this, please see the Assessing Taxonomy section. For information about obtaining and using locus search terms, please see the above sections on this page. Poor matches to the taxonomy or the locus search terms can result in suboptimal results, with many potentially useful sequences not being found during searches.

Modifications to record labels

When a sequence matching a taxon name and locus search term is found, it will be written to the locus-specific fasta file. However, the original record label will be modified. The accession number and appropriate taxon label (depending on the --no_subspecies flag) will be included, but the remaining description line is denoted by a DESCRIPTION phrase and may be prefaced with a newly created Voucher_ flag. The voucher tag is created automatically and can be used in downstream steps to create 'vouchered' datasets. The voucher tag can only be created if the sequence record contains a voucher, isolate, or strain field. The following example shows how the records are modified, both in terms of the description and the voucher tags.

Initial records:

>DQ237481.1 Liolaemus kingii voucher LJAMM3040 oocyte maturation factor Mos (C-mos) gene, partial cds 
TGCAGTAAGAATAGTTTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAATGTGGCACATCTTGACCATAACAATGTGGTACGTGTT...

>DQ340659.1 Lophognathus gilberti isolate ABTC12010 oocyte maturation factor Mos (c-mos) gene, partial cds
CTCCGTCTGGAATTTTCTCCATCTGTTAATGCACGACCCTGCAGTAGCCCCCTGGTATTTCCAACCAAAGGGGAAAAGTTATTTTCT...

>HF570650.1 Calumma guillaumeti partial cmos gene for oocyte maturation factor, specimen voucher ZSM:440/2005 
CCTGCAGCAGTCCCCTGGTGTTTCCAGTCAAAGGTGGGAAGTTATTTTGTGAGGAAGGTCCTTCTCCTAGGGCTTGCCGGCTACCCC...

>HM352537.1 Sauromalus ater oocyte maturation factor Mos (C-mos) gene, partial cds  
TGCAGTAAGAACAGTCTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAACGTGGCACATCTTGAACATAAAAATGTGGTACGTGTA...

Records written to the locus-specific fasta file:

>DQ237481.1 Liolaemus kingii Voucher_LJAMM3040 DESCRIPTION voucher ljamm3040 oocyte maturation factor mos c-mos gene partial cds 
TGCAGTAAGAATAGTTTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAATGTGGCACATCTTGACCATAACAATGTGGTACGTGTT...

>DQ340659.1 Lophognathus gilberti Voucher_ABTC12010 DESCRIPTION isolate abtc12010 oocyte maturation factor mos c-mos gene partial cds 
CTCCGTCTGGAATTTTCTCCATCTGTTAATGCACGACCCTGCAGTAGCCCCCTGGTATTTCCAACCAAAGGGGAAAAGTTATTTTCT...

>HF570650.1 Calumma guillaumeti Voucher_ZSM440/2005 DESCRIPTION partial cmos gene for oocyte maturation factor specimen voucher zsm440/2005 
CCTGCAGCAGTCCCCTGGTGTTTCCAGTCAAAGGTGGGAAGTTATTTTGTGAGGAAGGTCCTTCTCCTAGGGCTTGCCGGCTACCCC...

>HM352537.1 Sauromalus ater DESCRIPTION oocyte maturation factor mos c-mos gene partial cds  
TGCAGTAAGAACAGTCTGGCATCACGGCAGAGTTTCTGGGCAGAATTAAACGTGGCACATCTTGAACATAAAAATGTGGTACGTGTA...

Notice that the fourth record did not receive a Voucher_ flag because it lacked voucher, isolate, and strain fields in the original description.

It is not necessary to understand the exact details of how the record labels are modified, but it is important to understand they will no longer be identical to the original labels. These modifications allow SuperCRUNCH to quickly find information in downstream steps.

Back to top


Last updated: January, 2021