Error processing SEQNAME_ARRAY with h5py

For one reason or another, I want to parse the hal file output from cactus in python.

Using [h5py](https://docs.h5py.org/en/stable/index.html), like:

```py
import h5py

f = h5py.File('my.hal', 'r')
print(f['Anc00']['SEQNAME_ARRAY'])
```

gives:

```sh
...
  File "h5py/h5t.pyx", line 435, in h5py.h5t.TypeID.dtype.__get__
  File "h5py/h5t.pyx", line 951, in h5py.h5t.TypeIntegerID.py_dtype
TypeError: data type '<i15' not understood
```

Using h5dump to look at the header:
```sh
❯ h5dump -H my.hal | grep -A50 "Anc00" | grep -A2 "SEQNAME_ARRAY"
      DATASET "SEQNAME_ARRAY" {
         DATATYPE  120-bit little-endian integer 8-bit precision
         DATASPACE  SIMPLE { ( 32 ) / ( 32 ) }
```

Same as above, but look at another genome (`Pfa`):

```sh
TypeError: data type '<i14' not understood
```

and:
```sh
❯ h5dump -H my.hal | grep -A50 "Pfa" | grep -A2 "SEQNAME_ARRAY"
      DATASET "SEQNAME_ARRAY" {
         DATATYPE  112-bit little-endian integer 8-bit precision
         DATASPACE  SIMPLE { ( 16 ) / ( 16 ) }
```

**I am out of my depth here but**: `8 * 15 = 120` and `8 * 14 = 112`, so it's like the python library is considering this field as 8 lots of 14(15)-bit integers instead of the other way around. 

Or maybe it's just that there's no appropriate numpy type for this sort of variable length integer?

Thanks very much for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error processing SEQNAME_ARRAY with h5py #295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error processing SEQNAME_ARRAY with h5py #295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions