Use pyfaidx to access contigs in FASTA files #64

fedarko · 2022-11-07T02:12:07Z

A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.

When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using fasta_utils.get_single_seq()—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of those O(|contigs|^2) situations.

It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~O(1) (and thus cut the total runtime down to O(|contigs|). It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)

The text was updated successfully, but these errors were encountered:

fedarko added the performance gotta go fast label Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pyfaidx to access contigs in FASTA files #64

Use pyfaidx to access contigs in FASTA files #64

fedarko commented Nov 7, 2022

Use pyfaidx to access contigs in FASTA files #64

Use pyfaidx to access contigs in FASTA files #64

Comments

fedarko commented Nov 7, 2022