Skip to content
Matt Johnson edited this page May 2, 2016 · 17 revisions

#intronerate.py

Given a completed run of reads_first.py for a sample, run this script to generate "gene" sequences for each locus. The script will generat two new sequence files for each gene:

supercontig: A sequence containing all assembled contigs with a unique alignment to the reference protein, concatenated into one sequence.

introns: The supercontig with the exon sequences removed.

python interonerate.py --prefix hybseq_directory

Specify the name of a directory generated by reads_first.py in the prefix argument.

The default behavior is to refer to the file genes_with_seqs.txt to recover full length sequences only for the genes where exons were previously recovered. You may optionally supply a file containing a list of genes with --genelist filename

NOTE: The script will extract all sequence NOT annotated as exons by exonerate. This may be introns (or intergenic sequence), but it may also be mis-assembled contigs. While it may be difficult ot tell whether the sequence is "real" from a single sample, I recommend running intronerate.py on several samples. Then, extract the supercontig sequences with retrieve_sequences.py and align them. Sequences that appear in only one sample are probably from mis-assembled contigs and may be trimmed, for example using Trimal.

Clone this wiki locally