Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hal2fasta can be very slow #81

Open
diekhans opened this issue May 14, 2019 · 1 comment
Open

hal2fasta can be very slow #81

diekhans opened this issue May 14, 2019 · 1 comment

Comments

@diekhans
Copy link
Collaborator

diekhans commented May 14, 2019

from a user:

Just to followup on today's meeting on a couple of points.  First, here
is a test case for the extremely long extraction times for ranges on some
of the nodes in the 200Mammmal alignment.  I have a script that simply
reads ranges from a file and calls hal2fasta and times the run.  I ran it
on a node which goes quickly overall and one that is extremely slow.  Both
runs extract a similar amount of bp over 1000 ranges each:

Example normal run:

./benchExtract.pl 200Mammals/200m-v1.hal
fullTreeAnc208 fullTreeAnc208.bed
Total Ranges: 1000
Total Sequence: 350457 bp
Average hal2fasta extraction time: 15946.44 bp/sec
                                   0.02 records/sec
Total runtime: 22.200921 secs



./benchExtract.pl 200Mammals/200m-v1.hal
Acomys_cahirinus Aconmys_cahirinus.bed
Total Ranges: 1000
Total Sequence: 353557 bp
Average hal2fasta extraction time: 1129.61 bp/sec
                                   0.31 records/sec
Total runtime: 312.902046 secs

@diekhans
Copy link
Collaborator Author

From Joel

Interesting -- sorry that HAL is being a bit weird. I can reproduce the
same results on my machine. If I had to take a guess, it might be related
to the number of contigs in the genome. The 200M assemblies have a ton of
contigs, and the HDF5 version of HAL has to do a linear amount of work (in

of contigs) when loading the genome so it can understand how things are

laid out.

You can try adding --cacheBytes 0 to your hal2fasta command, this might
speed things up a bit, though there will probably still be a discrepancy in
runtime between the two genomes.

Mark (cc'd) & I developed a new backend format for HAL, out of our deep
frustration with the HDF5 library. It's much faster for some operations,
but the drawback is it takes quite a bit more space (3TB for the 200M), so
we shared the usual HDF5 version. If you are doing this type of thing a
lot, and it's a major blocker, you may want to use that format instead? I
get (roughly) the same runtime for both genomes using that format:

$ ./benchExtract.pl /mnt2/200m-v1-mmap.hal Acomys_cahirinus
Acomys_cahirinus.bed
Total Ranges: 1000
Total Sequence: 353557 bp
Average hal2fasta extraction time: 13041.86 bp/sec
0.03 records/sec
Total runtime: 28.413118 secs
$ ./benchExtract.pl /mnt2/200m-v1-mmap.hal fullTreeAnc208 fullTreeAnc208.bed
Total Ranges: 1000
Total Sequence: 350457 bp
Average hal2fasta extraction time: 17966.64 bp/sec
0.02 records/sec
Total runtime: 23.405557 secs

You could convert to the new (mmap-based) format using "halExtract
--outputFormat mmap --mmapFileSize 3250 ". The 200M
takes about a day to convert.

Alternatively, we could put in a BED input option to hal2fasta, so that the
sequence data structures would only need to be loaded once, which would
basically fix this particular pain point. But, being realistic about our
bandwidth here, that might take over a week to get around to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant