McLean K, Rosenthal CA, Sengupta D, Owens J, Cookson BT, Hoffman NG, Salipante SJ. 2019. Improved species-level clinical identification of Enterobacteriaceae through broad-range dnaJ PCR and sequencing. J Clin Microbiol 57:e00986-19.]
Chaperone DnaJ, also known as Hsp40 (heat shock protein 40 kD), is a molecular chaperone protein. It is expressed in a wide variety of organisms from bacteria to humans.
Question - Can we use the DnaJ dna region to help speciate organisms in the Enterobacteriaceae family and Escherichia and Shigella genera?
Here is an example of the kinds of Genbank records and annotations we are looking for in our pipeline
The first step is to construct an NCBI Esearch query for a list of accession numbers that may contain DnaJ features:
esearch -db nucleotide -query "Enterobacteriaceae[Organism] AND dnaj[All Fields]"
Which gives us 136,547 potential dnaJ records. The next step is to download the feature tables for all 136,547 records:
mefetch -db nucleotide -format ft -out feature_tables.txt
Besides the first example we looked at there may be many different ways dnaJ is annotated. To get an idea of what other kinds of annotations may correspond to the dnaJ region we can do something like this on the feature_tables.txt file:
grep -i 'dnaj' feature_table.txt | sort | uniq -c | sort -nr
Based on those counts we can parse coordinates from the following string patterns:
- product molecular chaperone DnaJ
- gene dnaJ
- product chaperone protein DnaJ
- product DnaJ protein
- product DnaJ domain-containing protein
- product dnaJ domain protein
- product chaperone dnaJ
- product DnaJ
- product dnaJ central domain protein
To parse the coordinates we can use ftract:
ftract --full-format -feature gene:gene:"dnaj" -feature CDS:product:"molecular chaperone DnaJ" -feature CDS:product:"chaperone protein DnaJ" -feature CDS:product:"DnaJ protein" -feature CDS:product:"DnaJ domain-containing protein" -feature CDS:product:"dnaJ domain protein" -feature CDS:product:"chaperone dnaJ" -feature CDS:product:"DnaJ" -feature CDS:product:"dnaJ central domain protein" feature_table.txt
Once we have each of these coordinates parsed let's do a quick record count:
Now we can download each dnaJ sequence by accession and coordinates:
mefetch -csv -id coordinates.csv -db nucleotide -format gbwithparts -out records.gb
This is a big file. The next steps involve converting the genbank records into a fasta file (seqs.fasta) and an annotations file (seq_info.csv).
After a deduplication step and selecting for named taxonomic records we end up with a total of 40,169 potential dnaJ records.
The next step is build an alignment profile to test sequence quality. For this process the first step is to select downloaded records that are designated type strain material.
"Type strain material is the taxonomic device that ties formal names to the physical specimens that serve as exemplars for the species. For the prokaryotes these are strains submitted to the culture collections; for the eukaryotes they are specimens submitted to museums or herbaria." - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383940/
And build a multiple alignment using muscle.
muscle -in seqs.fasta -out muscle.fasta
After creating the multiple alignment we can create a rough alignment profile using hmmer.
hmmbuild --dna rough.hmm muscle.fasta
Next we can search those same type strain sequences against the rough/type strain alignment profile:
hmmsearch -tblout hmmer.tsv rough.hmm seqs.fasta
It seems a little circular but what that does is allow me to side by side look at the muscle alignment and some alignment metrics and choose sequences that I like and sequences that I do not like.
And from there I can choose record LR134137_3722540_3722971
as my last accepted dnaJ sequence.
With that knowledge I can filter the type strains against the ordered hmmsearch results:
hmmer.py --last-in LR134137_3722540_3722971 seq_info.csv hmm_rough.tsv
And build another, higher confidence hmmer profile to filter the rest of the sequences:
hmmbuild --dna dnaj.hmm filtered.fasta
From there I can go thru the same hmmsearch and filter steps and go from 40,169 sequences to 25,450 high confidence sequences. From there a dnaJ alignment looks something like this:
- Figure out what is going on with the 14,719 dropped seqs.
- Turn this into our own NCBI dnaJ in-house reference database.