You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.
When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using fasta_utils.get_single_seq()—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of those O(|contigs|^2) situations.
It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~O(1) (and thus cut the total runtime down to O(|contigs|). It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)
The text was updated successfully, but these errors were encountered:
A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.
When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using
fasta_utils.get_single_seq()
—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of thoseO(|contigs|^2)
situations.It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~
O(1)
(and thus cut the total runtime down toO(|contigs|)
. It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)The text was updated successfully, but these errors were encountered: