Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FormatEigerNXmxFilewriter.get_raw_data fails when number of linked data files is large #695

Open
spmeisburger opened this issue Feb 13, 2024 · 3 comments

Comments

@spmeisburger
Copy link

This bug surfaced when we took a dataset with a huge number of frames (writing ~1100 data.h5 files). When we went to run dials.find_spots on this dataset, DIALS was extremely slow to start, and it failed with an error:

unable to open external link file name = '[...]_data_001013.h5'. 

I think the error is related to our use of NFS for data storage, i.e. there were too many file handles open at once. On a dataset with < 1000 data*.h5 files, dials.find_spots ran without an error but was extremely slow.

The traceback refers to this line in FormatEigerNXmxFilewriter.get_raw_data:

data_subsets = [v for k, v in sorted(nxdata.items()) if DATA_FILE_RE.match(k)]

https://github.com/cctbx/dxtbx/blob/f9013668291ff4bd8d1725178275444d72ac2fd1/src/dxtbx/format/FormatNXmxEigerFilewriter.py#L105C5-L105C83

It looks like every single linked _data*.h5 file is loaded here, for every call to get_raw_data(), which is kind of crazy. However, doing this appears to be essential because the number of images per data* file is not stored anywhere, as far as I can tell.

I made a patch in our local DIALS installation that mostly solves the problem, here: https://github.com/FlexXBeamline/dials-extensions/blob/faster-read-raw/dials_extensions/FormatNXmxEigerFilewriterCHESS.py

The patch only loads the first data file in the series, and uses it to determine the data shape. Perhaps something like this could be incorporated into FormatEigerNXmxFilewriter?

@graeme-winter
Copy link
Collaborator

@spmeisburger thanks for raising this - wasn't aware that this nonsense could happen for every get_raw_data() call: I am sure this is handled more gracefully for non-filewriter data.

Meanwhile you can also ulimit -n unlimited which will also make things work

@graeme-winter
Copy link
Collaborator

Suspect what we need to do is actually do the iteration step in the _start() method at

and cache the indexing such that the get_raw_data() method can just seek and read. However, I am also fairly sure that there is a good implementation of this elsewhere in dxtbx which does this from C++ which could be preferable 🤔 - certainly I remember this for the old "nearly nexus" and friends format

@biochem-fan
Copy link
Member

I'm not sure if we have a C++ version. Python codes for NearlyNexus are:

# cope with badly structured chunk information i.e. many more data

class DataFactory:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants