FlatGFA: Optimize GFA parsing a bit #153

sampsyo · 2024-03-16T20:51:55Z

A bunch of little optimizations guided by some profiling, all for the parsing part of polbin.

I used two human pangenome GFAs to measure stuff. Measured on havarti (reporting times to convert GFA -> FlatGFA):

So that's a 2.2x and 2.7x speedup for the two input graphs, respectively.

Optimizations included:

Getting rid of some collects to avoid allocating vectors.
Replacing usize IDs with u32 IDs.
The big one: optimizing for the (apparently common) case when segment names are sequential numbers, avoiding a hash table that was previously required to look up IDs by name.

Next steps would be:

Roll my own (regex-free) GFA parser.
Avoid the memcpy stage by pre-allocating big slabs of memory and parsing directly into there. Requires estimating the sizes of things, which seems hard?
Something about how weirdly large the "path steps" parser looms in the time profile??

Not sure why I thought this was necessary?

This ought to be enough for anybody!!

sampsyo added 8 commits March 16, 2024 14:14

Remove length parameter for extend

9e2023b

Not sure why I thought this was necessary?

Do not collect steps

648a731

Avoid collecting overlaps

b58f1cc

Indices are u32

69d607c

This ought to be enough for anybody!!

Refactor parser

2db8269

Optimize for sequential IDs

ba7d0bf

Refactor dense name map thing

091ad17

Use proper iterator for steps parser

ceaad0c

sampsyo merged commit 3bffe99 into main Mar 16, 2024
3 checks passed

sampsyo deleted the polbin-parse-opt branch March 16, 2024 22:01

This was referenced Mar 17, 2024

FlatGFA: Hand-rolled GFA parser #154

Merged

FlatGFA: Load the GFA text from a file directly #158

Merged

Provide feedback