Bug: No error message when no value calculated for HSE #402

Max1461 · 2023-03-22T15:18:17Z

Describe the bug
When generating and saving graphs, made from a sample set of pdbs containing created micro-envirnonments from pMHC structures, to a hdf5 file the following error occured for me:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/max/deeprank-core/deeprankcore/query.py", line 197, in _process_one_query
    graph.write_to_hdf5(output_path)
  File "/home/max/deeprank-core/deeprankcore/utils/graph.py", line 220, in write_to_hdf5
    node_features_group.create_dataset(
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/group.py", line 161, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 88, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1663, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1687, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1747, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-6150c50ecc2f> in <module>
      4 feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
      5 # Generate graphs and save them in hdf5 files
----> 6 output_paths = queries.process(output_path, feature_modules = feature_modules)

~/deeprank-core/deeprankcore/query.py in process(self, prefix, feature_modules, cpu_count, combine_output, grid_settings, grid_map_method, grid_augmentation_count)
    271         with Pool(self.cpu_count) as pool:
    272             _log.info('Starting pooling...\n')
--> 273             pool.map(pool_function, self.queries)
    274 
    275         output_paths = glob(f"{prefix}-*.hdf5")

~/anaconda3/lib/python3.9/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    362         in a list that is returned.
    363         '''
--> 364         return self._map_async(func, iterable, mapstar, chunksize).get()
    365 
    366     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/lib/python3.9/multiprocessing/pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

This was not a very clear error message to figure out what was actualy going wrong. After adding some print statements myself it turned out that this was caused because in the following steps from the graph.py script:

# store node features
node_key_list = list(self._nodes.keys())
first_node_data = list(self._nodes.values())[0].features
node_feature_names = list(first_node_data.keys())
print(node_feature_names)
for node_feature_name in node_feature_names:
    print(node_feature_name)
    node_feature_data = [node.features[node_feature_name] for node in self._nodes.values()]
    #print(node_feature_data)
    node_features_group.create_dataset(node_feature_name, data=node_feature_data)

Because for some of the pdb files no HSE could be calculated, this give something that is a None or empty value that can/is not (be) added, causing a discrapancy in node features between the graphs resulting in the error shown above.

It would be nice that if the HSE feature, or any of the other features, is not calculated or can not be calculated a clear error message that indicates which feature runs into this problem, so that the use can easily determine which feature was the problem without having to spit through the process themself.

If you want to reproduce this error, I have an example file of a pdb that works:
1AKJ_1_ILE.txt
And a file that runs into this error:
1AKJ_2_LEU.txt

Running the second file with:

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1AKJ_2_LEU.pdb",
    chain_id1 = "A",
    chain_id2 = "C",
    targets = {
        "binary": 0
    }
))
output_path = os.path.join(output_directory, project_id)
# Set feature to be used by feature modules named in feature_names
feature_names = ['components', 'contact', 'exposure', 'surfacearea']
feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
# Generate graphs and save them in hdf5 files
output_paths = queries.process(output_path, feature_modules = feature_modules)

Should reproduce the error if wanted. The main "issue" is mainly the lack of error message from the exposure.py script where no message/error was given to indicate that the problem lay there.

EDITED by @DaniBodor to fix the code blocks.

The text was updated successfully, but these errors were encountered:

gcroci2 · 2023-03-30T12:26:34Z

We could add something NaN-like in such cases, doing a check before storing both node and edge features. We can add a warning message in the exposure.py script as well. @DaniBodor we'll discuss who will pick this up next week

DaniBodor · 2023-03-30T14:29:04Z

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

gcroci2 · 2023-03-30T14:55:27Z

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

There are several opinions about how to set nan values, and it depends a lot on the feature (e.g. different values for different features) so I wouldn't enforce any default. I would say is up to the user to decide how to fill them up. Integrating this in the code base giving at the same time great flexibility about which value to fill in for each feature, without breaking anything and doing it properly, I think is not trivial at all. Also, we need to think about a way that doesn't increase the overhead too much, that's why I would do the check before writing the features to the hdf5 files.

We could also just add a nan count on each histogram, or create a dict during the graphs generation and at the end print out how many nans are present in each feature (so something the user can have access to and can notice). Together with this, we can improve the warnings in the feature modules in such cases.

DaniBodor · 2023-03-30T16:29:16Z

Good point about defaulting NaNs.

I still think it would be a good idea to have a default check for NaNs during graph creation (maybe after each feature module is called or something) and before hdf5 is created, so that for future/custom feature modules, if it is not handled within the module, there is still a default error message that makes it clear what the problem is/where it happened.

github-actions · 2023-05-05T03:22:04Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2023-07-31T03:21:00Z

This issue is stale because it has been open for 30 days with no activity.

DaniBodor · 2023-09-19T16:38:16Z

@gcroci2 , has this been addressed/solved yet?

gcroci2 · 2023-09-20T08:26:04Z

@gcroci2 , has this been addressed/solved yet?

Nope. We can add a check for hse feature, whenever it's computed, and log a warning in case it's empty/none; then we need to default such cases to some hdf5-acceptable value, such as a negative integer value (hse dominium is always positive, right?)

github-actions · 2024-08-12T03:27:23Z

This issue is stale because it has been open for 30 days with no activity.

Max1461 added the bug Something isn't working label Mar 22, 2023

DaniBodor added the covariates modules from the 'features' subpackage label Apr 3, 2023

DaniBodor self-assigned this Apr 3, 2023

github-actions bot added the stale issue not touched from too much time label May 5, 2023

DaniBodor added priority Solve this first and removed stale issue not touched from too much time labels May 30, 2023

DaniBodor removed their assignment Jun 29, 2023

github-actions bot added the stale issue not touched from too much time label Jul 31, 2023

github-actions bot removed the stale issue not touched from too much time label Sep 20, 2023

gcroci2 removed the priority Solve this first label Nov 1, 2023

DaniBodor self-assigned this Jul 9, 2024

DaniBodor removed the bug Something isn't working label Jul 9, 2024

gcroci2 added this to Development Jul 12, 2024

gcroci2 moved this to To do in Development Jul 12, 2024

github-actions bot added the stale issue not touched from too much time label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: No error message when no value calculated for HSE #402

Bug: No error message when no value calculated for HSE #402

Max1461 commented Mar 22, 2023 •

edited by DaniBodor

Loading

gcroci2 commented Mar 30, 2023

DaniBodor commented Mar 30, 2023

gcroci2 commented Mar 30, 2023 •

edited

Loading

DaniBodor commented Mar 30, 2023

github-actions bot commented May 5, 2023

github-actions bot commented Jul 31, 2023

DaniBodor commented Sep 19, 2023

gcroci2 commented Sep 20, 2023

github-actions bot commented Aug 12, 2024

Bug: No error message when no value calculated for HSE #402

Bug: No error message when no value calculated for HSE #402

Comments

Max1461 commented Mar 22, 2023 • edited by DaniBodor Loading

gcroci2 commented Mar 30, 2023

DaniBodor commented Mar 30, 2023

gcroci2 commented Mar 30, 2023 • edited Loading

DaniBodor commented Mar 30, 2023

github-actions bot commented May 5, 2023

github-actions bot commented Jul 31, 2023

DaniBodor commented Sep 19, 2023

gcroci2 commented Sep 20, 2023

github-actions bot commented Aug 12, 2024

Max1461 commented Mar 22, 2023 •

edited by DaniBodor

Loading

gcroci2 commented Mar 30, 2023 •

edited

Loading