You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Just to followup on today's meeting on a couple of points. First, here
is a test case for the extremely long extraction times for ranges on some
of the nodes in the 200Mammmal alignment. I have a script that simply
reads ranges from a file and calls hal2fasta and times the run. I ran it
on a node which goes quickly overall and one that is extremely slow. Both
runs extract a similar amount of bp over 1000 ranges each:
Example normal run:
./benchExtract.pl 200Mammals/200m-v1.hal
fullTreeAnc208 fullTreeAnc208.bed
Total Ranges: 1000
Total Sequence: 350457 bp
Average hal2fasta extraction time: 15946.44 bp/sec
0.02 records/sec
Total runtime: 22.200921 secs
./benchExtract.pl 200Mammals/200m-v1.hal
Acomys_cahirinus Aconmys_cahirinus.bed
Total Ranges: 1000
Total Sequence: 353557 bp
Average hal2fasta extraction time: 1129.61 bp/sec
0.31 records/sec
Total runtime: 312.902046 secs
The text was updated successfully, but these errors were encountered:
Interesting -- sorry that HAL is being a bit weird. I can reproduce the
same results on my machine. If I had to take a guess, it might be related
to the number of contigs in the genome. The 200M assemblies have a ton of
contigs, and the HDF5 version of HAL has to do a linear amount of work (in
of contigs) when loading the genome so it can understand how things are
laid out.
You can try adding --cacheBytes 0 to your hal2fasta command, this might
speed things up a bit, though there will probably still be a discrepancy in
runtime between the two genomes.
Mark (cc'd) & I developed a new backend format for HAL, out of our deep
frustration with the HDF5 library. It's much faster for some operations,
but the drawback is it takes quite a bit more space (3TB for the 200M), so
we shared the usual HDF5 version. If you are doing this type of thing a lot, and it's a major blocker, you may want to use that format instead? I
get (roughly) the same runtime for both genomes using that format:
$ ./benchExtract.pl /mnt2/200m-v1-mmap.hal Acomys_cahirinus
Acomys_cahirinus.bed
Total Ranges: 1000
Total Sequence: 353557 bp
Average hal2fasta extraction time: 13041.86 bp/sec
0.03 records/sec
Total runtime: 28.413118 secs
$ ./benchExtract.pl /mnt2/200m-v1-mmap.hal fullTreeAnc208 fullTreeAnc208.bed
Total Ranges: 1000
Total Sequence: 350457 bp
Average hal2fasta extraction time: 17966.64 bp/sec
0.02 records/sec
Total runtime: 23.405557 secs
You could convert to the new (mmap-based) format using "halExtract
--outputFormat mmap --mmapFileSize 3250 ". The 200M
takes about a day to convert.
Alternatively, we could put in a BED input option to hal2fasta, so that the
sequence data structures would only need to be loaded once, which would
basically fix this particular pain point. But, being realistic about our
bandwidth here, that might take over a week to get around to.
from a user:
The text was updated successfully, but these errors were encountered: