Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue64 #66

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Issue64 #66

wants to merge 3 commits into from

Conversation

bnlawrence
Copy link

This is the simple fix described for #64, includes a test that exposes it, and the one liner to fix it. Works for a number of other data files. Now includes the correct issue number in the commits.

assert 'data' in hfile
d = hfile['data']
# the point of this test is to ensure we can actually retrieve the compression opts
x = d.compression_opts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an enhancement to the unit test, could you also check the value of x to be correct?

Comment on lines 323 to 326
if d['filter_id'] == GZIP_DEFLATE_FILTER][0]
return gzip_entry['client_data'][0]
key = {0:'client_data_values',1:'client_data'}['client_data' in gzip_entry]
return gzip_entry[key][0]
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the documentation pointed to in your issue, [link], it seems there is only a single list of client data values. Instead of your suggested fix, I think we instead should search-and-replace in the code and never produce any dictionary with "client_data" as the key. (disclaimer: I haven't actually studied the source code yet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see arguments for doing it both ways, the code I've committed is as faithful to what is in the file at point of use, to do different would require hacking the message structure before using it ... I'd defer to the package owners on what is the best option.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, thanks for elaborating!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(At some point you'd have to do what I've got above, insofar as we see files with both compression option structures, we can't control what we will get to have to read.)

@bmaranville
Copy link
Collaborator

I think there is a bug in dataobjects.py, when reading the filter pipeline.

For version 1 pipeline descriptions, filter_info["client_data_values"] is read directly from the message and corresponds to the length of the client data array. The values are then read into filter_info["client_data"].

For version 2 pipelines, the values in the client data array are stored in filter_info["client_data_values"], and a different variable (num_client_values) is used to temporarily store the length of the array.

This is incorrect - the values should be stored in filter_info["client_data"] as they are for v1, and possibly for consistency we could store the length of the value array in filter_info["client_data_values"] (though that value would probably not get used again)

If this is fixed then the changes to dataobjects.py above are not needed.

@bmaranville bmaranville self-requested a review January 7, 2025 14:38
Copy link
Collaborator

@bmaranville bmaranville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend fixing the section of dataobjects.py that reads the client data from the file into a dict rather than trying to deal with the broken dict after the fact.

@bmaranville
Copy link
Collaborator

Ha... looking back through the history it looks like it might be me that committed the code that I'm suggesting be fixed here.

@bmaranville
Copy link
Collaborator

diff --git a/pyfive/dataobjects.py b/pyfive/dataobjects.py
index adb19eb..5cdc169 100644
--- a/pyfive/dataobjects.py
+++ b/pyfive/dataobjects.py
@@ -409,7 +409,8 @@ class DataObjects(object):
                 filter_info['name'] = name
                 client_values = struct.unpack_from("<{:d}i".format(num_client_values), self.msg_data, offset)
                 offset += (4 * num_client_values)
-                filter_info['client_data_values'] = client_values
+                filter_info['client_data'] = client_values
+                filter_info['client_data_values'] = num_client_values
 
                 filters.append(filter_info)
         else:

@bmaranville
Copy link
Collaborator

Your new test passes after the diff above (without the other change in ca1b6d6)

@kalvdans
Copy link

kalvdans commented Jan 7, 2025

+                filter_info['client_data'] = client_values
+                filter_info['client_data_values'] = num_client_values

Storing the number of values in a separate key is redundant, as it could be extracted with len(filter_info['client_data']).

@bmaranville
Copy link
Collaborator

Oh, I agree it's redundant - I was just trying to keep the dict consistent between v1 and v2 pipelines (in v1, you have to have that key because it's part of the struct message parsed directly from disk, while in v2 we're manually creating the dict)

@bnlawrence
Copy link
Author

Yes, it works for me too! What's the best way forward? Do you want me to fix this in my pull request and go from there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants