Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyfaidx to access contigs in FASTA files #64

Open
fedarko opened this issue Nov 7, 2022 · 0 comments
Open

Use pyfaidx to access contigs in FASTA files #64

fedarko opened this issue Nov 7, 2022 · 0 comments
Labels
performance gotta go fast

Comments

@fedarko
Copy link
Owner

fedarko commented Nov 7, 2022

A few of the commands need to access specific contig sequences while iterating over all contigs—for example, we go through all contigs → for those that match some condition, retrieve the contig sequence → do some more stuff using the sequence.

When there are thousands of contigs, I think this process will become slow (or at least slower than needed) because we retrieve sequences using fasta_utils.get_single_seq()—which in turn goes through, in the worst case, every sequence in the FASTA file. Since this is done during an iteration over all contigs, it's one of those O(|contigs|^2) situations.

It is possible to speed this up by indexing the FASTA file to allow for "random access," which should cut the runtime of accessing a given contig down to ~O(1) (and thus cut the total runtime down to O(|contigs|). It would probably make sense to just use an existing library that supports indexing / random access in FASTA files (rather than re-inventing the wheel); pyfaidx seems good. (We currently read FASTA files using scikit-bio, and we can probably keep that around for most things, but I don't think it supports indexing / random access.)

@fedarko fedarko added the performance gotta go fast label Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance gotta go fast
Projects
None yet
Development

No branches or pull requests

1 participant