-
Notifications
You must be signed in to change notification settings - Fork 8
4. Extracting taxonomic information from MitoFish and NCBI
shenjean edited this page Mar 16, 2022
·
16 revisions
- Use R taxonomizr package to map taxIDs to taxonomic names. See below for R code (not included in repository) - save it as
taxonomizr.R
.
Note: Records that the software were unable to map can be marked in the output as "NA". Any classification marked as "NA" is manually reviewed and corrected where necessary. Nevertheless, there are indeed some fish species with unassigned class/order/family/genus/species. In this case, the "NA" annotation is kept.
library(taxonomizr)
# Prepare the taxonomy database. This will only have to be done once.
# Note it requires a lot of hard drive space, bandwidth and time to process all the data from NCBI
prepareDatabase('accessionTaxa.sql')
input=read.delim("complete.partial.accession.sorted",header=F)
acc=as.vector(input[,1])
# First, get list of taxa IDs
taxIDs=accessionToTaxa(acc,'accessionTaxa.sql',version='base')
# Then, get taxonomic names
taxa=getTaxonomy(taxIDs,'accessionTaxa.sql')
# Write output to table
write.table(taxa,'complete.partial.taxa.tsv',sep="\t")
- Run the R code:
Rscript taxonomizr.R
- Contents of
complete.partial.taxa.tsv
:
"superkingdom" "phylum" "class" "order" "family" "genus" "species"
" 8255" "Eukaryota" "Chordata" "Actinopteri" "Pleuronectiformes" "Paralichthyidae" "Paralichthys" "Paralichthys olivaceus"
" 8255" "Eukaryota" "Chordata" "Actinopteri" "Pleuronectiformes" "Paralichthyidae" "Paralichthys" "Paralichthys olivaceus"
" 8255" "Eukaryota" "Chordata" "Actinopteri" "Pleuronectiformes" "Paralichthyidae" "Paralichthys" "Paralichthys olivaceus"
- Remove quotes and fix header of
complete.partial.taxa.tsv
. The first column containing NCBI taxonomy IDs will be namedtaxid
:
cat complete.partial.taxa.tsv | grep -v superkingdom | tr -d "\"" >complete.partial.noheader.taxtable
# Create new header
echo -e "taxid\tSuperkingdom\tPhylum\tClass\tOrder\tFamily\tGenus\tSpecies" >tax.header
cat tax.header complete.partial.noheader.taxtable >complete.partial.taxtable
- Contents of
complete.partial.taxtable
:
taxid Superkingdom Phylum Class Order Family Genus Species
135755 Eukaryota Chordata Actinopteri Centrarchiformes Percichthyidae Gadopsis Gadopsis marmoratus
1581706 Eukaryota Chordata Actinopteri Gobiiformes Gobiidae Periophthalmus Periophthalmus minutus
36177 Eukaryota Chordata Actinopteri Acipenseriformes Acipenseridae Acipenser Acipenser oxyrinchus
8240 Eukaryota Chordata Actinopteri Scombriformes Scombridae Thunnus Thunnus maccoyii
- Extract NCBI taxonomy IDs:
grep ">" mitofish/*.fa mitofish/duplicates/*.fa | awk -F "|" '{print $6}' >complete.full.taxIDs
- Use R taxonimzr package to map NCBI taxonomy IDs to taxonomic names. See below for R code (not included in repository):
library(taxonomizr)
# Mapping taxonomy IDs to taxonomic names
taxID=read.delim("complete.full.taxIDs",header=F)
taxIDs=taxID[,1]
taxa=getTaxonomy(taxIDs,'accessionTaxa.sql')
write.table(taxa,"complete.full.taxa.tsv",sep="\t")
- Contents of
complete.full.taxa.tsv
:
"superkingdom" "phylum" "class" "order" "family" "genus" "species"
" 8038" "Eukaryota" "Chordata" "Actinopteri" "Salmoniformes" "Salmonidae" "Salvelinus" "Salvelinus fontinalis"
" 8036" "Eukaryota" "Chordata" "Actinopteri" "Salmoniformes" "Salmonidae" "Salvelinus" "Salvelinus alpinus"
" 79736" "Eukaryota" "Chordata" "Chondrichthyes" "Carcharhiniformes" "Triakidae" "Mustelus" "Mustelus manazo"
" 386614" "Eukaryota" "Chordata" "Chondrichthyes" "Rajiformes" "Rajidae" "Amblyraja" "Amblyraja radiata"
- Remove quotates and fix header of
complete.taxa.tsv
. The first column containing NCBI taxonomy IDs will be namedtxid
:
cat complete.full.taxa.tsv | grep -v superkingdom | tr -d "\"" >complete.full.noheader.taxtable
cat tax.header complete.full.noheader.taxtable >complete.full.taxtable
- Contents of
complete.full.taxtable
:
taxid Superkingdom Phylum Class Order Family Genus Species
8038 Eukaryota Chordata Actinopteri Salmoniformes Salmonidae Salvelinus Salvelinus fontinalis
8036 Eukaryota Chordata Actinopteri Salmoniformes Salmonidae Salvelinus Salvelinus alpinus
79736 Eukaryota Chordata Chondrichthyes Carcharhiniformes Triakidae Mustelus Mustelus manazo