Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: No error message when no value calculated for HSE #402

Open
Max1461 opened this issue Mar 22, 2023 · 9 comments
Open

Bug: No error message when no value calculated for HSE #402

Max1461 opened this issue Mar 22, 2023 · 9 comments
Assignees
Labels
covariates modules from the 'features' subpackage stale issue not touched from too much time

Comments

@Max1461
Copy link

Max1461 commented Mar 22, 2023

Describe the bug
When generating and saving graphs, made from a sample set of pdbs containing created micro-envirnonments from pMHC structures, to a hdf5 file the following error occured for me:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/max/anaconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/max/deeprank-core/deeprankcore/query.py", line 197, in _process_one_query
    graph.write_to_hdf5(output_path)
  File "/home/max/deeprank-core/deeprankcore/utils/graph.py", line 220, in write_to_hdf5
    node_features_group.create_dataset(
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/group.py", line 161, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/home/max/anaconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 88, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1663, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1687, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1747, in h5py.h5t.py_create
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-8-6150c50ecc2f> in <module>
      4 feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
      5 # Generate graphs and save them in hdf5 files
----> 6 output_paths = queries.process(output_path, feature_modules = feature_modules)

~/deeprank-core/deeprankcore/query.py in process(self, prefix, feature_modules, cpu_count, combine_output, grid_settings, grid_map_method, grid_augmentation_count)
    271         with Pool(self.cpu_count) as pool:
    272             _log.info('Starting pooling...\n')
--> 273             pool.map(pool_function, self.queries)
    274 
    275         output_paths = glob(f"{prefix}-*.hdf5")

~/anaconda3/lib/python3.9/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    362         in a list that is returned.
    363         '''
--> 364         return self._map_async(func, iterable, mapstar, chunksize).get()
    365 
    366     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/lib/python3.9/multiprocessing/pool.py in get(self, timeout)
    769             return self._value
    770         else:
--> 771             raise self._value
    772 
    773     def _set(self, i, obj):

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

This was not a very clear error message to figure out what was actualy going wrong. After adding some print statements myself it turned out that this was caused because in the following steps from the graph.py script:

# store node features
node_key_list = list(self._nodes.keys())
first_node_data = list(self._nodes.values())[0].features
node_feature_names = list(first_node_data.keys())
print(node_feature_names)
for node_feature_name in node_feature_names:
    print(node_feature_name)
    node_feature_data = [node.features[node_feature_name] for node in self._nodes.values()]
    #print(node_feature_data)
    node_features_group.create_dataset(node_feature_name, data=node_feature_data)

Because for some of the pdb files no HSE could be calculated, this give something that is a None or empty value that can/is not (be) added, causing a discrapancy in node features between the graphs resulting in the error shown above.

It would be nice that if the HSE feature, or any of the other features, is not calculated or can not be calculated a clear error message that indicates which feature runs into this problem, so that the use can easily determine which feature was the problem without having to spit through the process themself.

If you want to reproduce this error, I have an example file of a pdb that works:
1AKJ_1_ILE.txt
And a file that runs into this error:
1AKJ_2_LEU.txt

Running the second file with:

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceResidueQuery(
    pdb_path = "1AKJ_2_LEU.pdb",
    chain_id1 = "A",
    chain_id2 = "C",
    targets = {
        "binary": 0
    }
))
output_path = os.path.join(output_directory, project_id)
# Set feature to be used by feature modules named in feature_names
feature_names = ['components', 'contact', 'exposure', 'surfacearea']
feature_modules = [importlib.import_module('deeprankcore.features.' + name) for name in feature_names]
# Generate graphs and save them in hdf5 files
output_paths = queries.process(output_path, feature_modules = feature_modules)

Should reproduce the error if wanted. The main "issue" is mainly the lack of error message from the exposure.py script where no message/error was given to indicate that the problem lay there.

EDITED by @DaniBodor to fix the code blocks.

@Max1461 Max1461 added the bug Something isn't working label Mar 22, 2023
@gcroci2
Copy link
Collaborator

gcroci2 commented Mar 30, 2023

We could add something NaN-like in such cases, doing a check before storing both node and edge features. We can add a warning message in the exposure.py script as well. @DaniBodor we'll discuss who will pick this up next week

@DaniBodor
Copy link
Collaborator

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

@gcroci2
Copy link
Collaborator

gcroci2 commented Mar 30, 2023

Maybe we can look for a way to check for NaN/missing values systematically across all features after generating the graph, and output an error with missing features and/or an option to set such values to 0.

There are several opinions about how to set nan values, and it depends a lot on the feature (e.g. different values for different features) so I wouldn't enforce any default. I would say is up to the user to decide how to fill them up. Integrating this in the code base giving at the same time great flexibility about which value to fill in for each feature, without breaking anything and doing it properly, I think is not trivial at all. Also, we need to think about a way that doesn't increase the overhead too much, that's why I would do the check before writing the features to the hdf5 files.

We could also just add a nan count on each histogram, or create a dict during the graphs generation and at the end print out how many nans are present in each feature (so something the user can have access to and can notice). Together with this, we can improve the warnings in the feature modules in such cases.

@DaniBodor
Copy link
Collaborator

Good point about defaulting NaNs.

I still think it would be a good idea to have a default check for NaNs during graph creation (maybe after each feature module is called or something) and before hdf5 is created, so that for future/custom feature modules, if it is not handled within the module, there is still a default error message that makes it clear what the problem is/where it happened.

@DaniBodor DaniBodor added the covariates modules from the 'features' subpackage label Apr 3, 2023
@DaniBodor DaniBodor self-assigned this Apr 3, 2023
@github-actions
Copy link

github-actions bot commented May 5, 2023

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label May 5, 2023
@DaniBodor DaniBodor added priority Solve this first and removed stale issue not touched from too much time labels May 30, 2023
@DaniBodor DaniBodor removed their assignment Jun 29, 2023
@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label Jul 31, 2023
@DaniBodor
Copy link
Collaborator

@gcroci2 , has this been addressed/solved yet?

@github-actions github-actions bot removed the stale issue not touched from too much time label Sep 20, 2023
@gcroci2
Copy link
Collaborator

gcroci2 commented Sep 20, 2023

@gcroci2 , has this been addressed/solved yet?

Nope. We can add a check for hse feature, whenever it's computed, and log a warning in case it's empty/none; then we need to default such cases to some hdf5-acceptable value, such as a negative integer value (hse dominium is always positive, right?)

@gcroci2 gcroci2 removed the priority Solve this first label Nov 1, 2023
@DaniBodor DaniBodor self-assigned this Jul 9, 2024
@DaniBodor DaniBodor removed the bug Something isn't working label Jul 9, 2024
@gcroci2 gcroci2 moved this to To do in Development Jul 12, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale issue not touched from too much time label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
covariates modules from the 'features' subpackage stale issue not touched from too much time
Projects
Status: To do
Development

No branches or pull requests

3 participants