Open
Description
For one reason or another, I want to parse the hal file output from cactus in python.
Using h5py, like:
import h5py
f = h5py.File('my.hal', 'r')
print(f['Anc00']['SEQNAME_ARRAY'])
gives:
...
File "h5py/h5t.pyx", line 435, in h5py.h5t.TypeID.dtype.__get__
File "h5py/h5t.pyx", line 951, in h5py.h5t.TypeIntegerID.py_dtype
TypeError: data type '<i15' not understood
Using h5dump to look at the header:
❯ h5dump -H my.hal | grep -A50 "Anc00" | grep -A2 "SEQNAME_ARRAY"
DATASET "SEQNAME_ARRAY" {
DATATYPE 120-bit little-endian integer 8-bit precision
DATASPACE SIMPLE { ( 32 ) / ( 32 ) }
Same as above, but look at another genome (Pfa
):
TypeError: data type '<i14' not understood
and:
❯ h5dump -H my.hal | grep -A50 "Pfa" | grep -A2 "SEQNAME_ARRAY"
DATASET "SEQNAME_ARRAY" {
DATATYPE 112-bit little-endian integer 8-bit precision
DATASPACE SIMPLE { ( 16 ) / ( 16 ) }
I am out of my depth here but: 8 * 15 = 120
and 8 * 14 = 112
, so it's like the python library is considering this field as 8 lots of 14(15)-bit integers instead of the other way around.
Or maybe it's just that there's no appropriate numpy type for this sort of variable length integer?
Thanks very much for your time.
Metadata
Metadata
Assignees
Labels
No labels