-
Notifications
You must be signed in to change notification settings - Fork 45
Introns
#intronerate.py
Given a completed run of reads_first.py
for a sample, run this script to generate "gene" sequences for each locus. The script will generat two new sequence files for each gene:
supercontig: A sequence containing all assembled contigs with a unique alignment to the reference protein, concatenated into one sequence.
introns: The supercontig with the exon sequences removed.
python interonerate.py --prefix hybseq_directory
Specify the name of a directory generated by reads_first.py
in the prefix argument.
The default behavior is to refer to the file genes_with_seqs.txt
to recover full length sequences only for the genes where exons were previously recovered. You may optionally supply a file containing a list of genes with --genelist filename
NOTE: The script will extract all sequence NOT annotated as exons by exonerate. This may be introns (or intergenic sequence), but it may also be mis-assembled contigs. While it may be difficult ot tell whether the sequence is "real" from a single sample, I recommend running intronerate.py
on several samples. Then, extract the supercontig sequences with retrieve_sequences.py
and align them. Sequences that appear in only one sample are probably from mis-assembled contigs and may be trimmed, for example using Trimal.