Skip to content

Latest commit

 

History

History
40 lines (23 loc) · 1.8 KB

PqsE_HhqE_ortholog_workflow.md

File metadata and controls

40 lines (23 loc) · 1.8 KB

PqsE and HhqE orthologous sequences were downloaded from the following links: Pseudomonas Genome DB (https://pseudomonas.com/orthologs/list?id=1653897) and Burkholderia Genome DB (https://www.burkholderia.com/orthologs/list?id=21479956) which resulted in the following .csv files PqsE_ortholog.csv and HhqE_ortholog.csv.

Since there are a lot of unwanted symbols from HhqE_ortholog.csv file, I used regex commands to find and replace a few things

searched for "," replaced with "\n". Then searched for "(^B)" and replaced with >\1.

Some species names had "assembly" tacked onto the end. To get rid of that:

I searched for \s-\sAssembly\s and replaced with "_"

I followed the same as above for PqsE_ortholog.csv.

The two cleaned files were renamed PqsE_ortholog.faa and HhqE_ortholog.faa and concatonated into one file PqsE_HhqE_ortholog.faa

cat PqsE_ortholog.faa HhqE_ortholog.faa > PqsE_HhqE_ortholog.faa

Some sequence names still sucked and were too janky for MUSCLE, so I removed the symbols with the following sed commands

sed -i 's/ /_/g' pqsE_orthologs.faa
sed -i 's/-/_/g' pqsE_orthologs.faa
sed -i 's/[.]/_/g' pqsE_orthologs.faa
sed -i 's/#/_/g' pqsE_orthologs.faa
sed -i 's/[/]/_/g' pqsE_orthologs.faa
sed -i 's/__/_/g' pqsE_orthologs.faa

Sequence was aligned with MUSCLE

muscle -in PqsE_HhqE_ortholog.faa -out PqsE_HhqE_ortholog_aln.faa

Trimmed with trimal

trimal -in PqsE_HhqE_ortholog_aln.faa -out PqsE_HhqE_ortholog_aln_trm.faa -keepheader 

Note: The '-keepheader' option sepcifies to keep headers, I've had unfortunate times with trimal stealing my headers without

Tree made with iqtree

iqtree -s PqsE_HhqE_ortholog_aln_trm.faa -bb 1000 -m

The file PqsE_HhqE_ortholog_aln_trm.faa.treefile was uploaded to iTOL for visualization