`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695

spmeisburger · 2024-02-13T22:22:06Z

This bug surfaced when we took a dataset with a huge number of frames (writing ~1100 data.h5 files). When we went to run dials.find_spots on this dataset, DIALS was extremely slow to start, and it failed with an error:

unable to open external link file name = '[...]_data_001013.h5'.

I think the error is related to our use of NFS for data storage, i.e. there were too many file handles open at once. On a dataset with < 1000 data*.h5 files, dials.find_spots ran without an error but was extremely slow.

The traceback refers to this line in FormatEigerNXmxFilewriter.get_raw_data:

data_subsets = [v for k, v in sorted(nxdata.items()) if DATA_FILE_RE.match(k)]

https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatNXmxEigerFilewriter.py#L105C5-L105C83

It looks like every single linked _data*.h5 file is loaded here, for every call to get_raw_data(), which is kind of crazy. However, doing this appears to be essential because the number of images per data* file is not stored anywhere, as far as I can tell.

I made a patch in our local DIALS installation that mostly solves the problem, here: https://github.com/FlexXBeamline/dials-extensions/blob/faster-read-raw/dials_extensions/FormatNXmxEigerFilewriterCHESS.py

The patch only loads the first data file in the series, and uses it to determine the data shape. Perhaps something like this could be incorporated into FormatEigerNXmxFilewriter?

The text was updated successfully, but these errors were encountered:

graeme-winter · 2024-02-14T05:46:38Z

@spmeisburger thanks for raising this - wasn't aware that this nonsense could happen for every get_raw_data() call: I am sure this is handled more gracefully for non-filewriter data.

Meanwhile you can also ulimit -n unlimited which will also make things work

graeme-winter · 2024-02-14T06:41:28Z

Suspect what we need to do is actually do the iteration step in the _start() method at

dxtbx/src/dxtbx/format/FormatNXmxEigerFilewriter.py

Line 31 in f901366

def _start(self):

and cache the indexing such that the get_raw_data() method can just seek and read. However, I am also fairly sure that there is a good implementation of this elsewhere in dxtbx which does this from C++ which could be preferable 🤔 - certainly I remember this for the old "nearly nexus" and friends format

biochem-fan · 2024-02-14T06:50:09Z

I'm not sure if we have a C++ version. Python codes for NearlyNexus are:

dxtbx/src/dxtbx/format/FormatHDF5EigerNearlyNexus.py

Line 124 in f901366

# cope with badly structured chunk information i.e. many more data

dxtbx/src/dxtbx/format/nexus.py

Line 1341 in f901366

class DataFactory:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695

`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695

spmeisburger commented Feb 13, 2024

graeme-winter commented Feb 14, 2024

graeme-winter commented Feb 14, 2024

biochem-fan commented Feb 14, 2024

FormatEigerNXmxFilewriter.get_raw_data fails when number of linked data files is large #695

FormatEigerNXmxFilewriter.get_raw_data fails when number of linked data files is large #695

Comments

spmeisburger commented Feb 13, 2024

graeme-winter commented Feb 14, 2024

graeme-winter commented Feb 14, 2024

biochem-fan commented Feb 14, 2024

`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695

`FormatEigerNXmxFilewriter.get_raw_data` fails when number of linked data files is large #695