Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatGFA: Optimize GFA parsing a bit #153

Merged
merged 8 commits into from
Mar 16, 2024
Merged

FlatGFA: Optimize GFA parsing a bit #153

merged 8 commits into from
Mar 16, 2024

Conversation

sampsyo
Copy link
Collaborator

@sampsyo sampsyo commented Mar 16, 2024

A bunch of little optimizations guided by some profiling, all for the parsing part of polbin.

I used two human pangenome GFAs to measure stuff. Measured on havarti (reporting times to convert GFA -> FlatGFA):

chr22 chr8
original GFA size 2.4 GB 3.9 GB
FlatGA size 1.5 GB 2.1 GB
before time 28s 49s
after time 13s 18s

So that's a 2.2x and 2.7x speedup for the two input graphs, respectively.

Optimizations included:

  • Getting rid of some collects to avoid allocating vectors.
  • Replacing usize IDs with u32 IDs.
  • The big one: optimizing for the (apparently common) case when segment names are sequential numbers, avoiding a hash table that was previously required to look up IDs by name.

Next steps would be:

  • Roll my own (regex-free) GFA parser.
  • Avoid the memcpy stage by pre-allocating big slabs of memory and parsing directly into there. Requires estimating the sizes of things, which seems hard?
  • Something about how weirdly large the "path steps" parser looms in the time profile??

@sampsyo sampsyo merged commit 3bffe99 into main Mar 16, 2024
3 checks passed
@sampsyo sampsyo deleted the polbin-parse-opt branch March 16, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant