Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding post_indexing_validation module #246

Merged
merged 10 commits into from
Dec 23, 2024

Conversation

jacob50231
Copy link
Contributor

Link to JIRA ticket if there is one:

New Features

Adds a module for post-indexing validation

Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

Copy link
Contributor

@BedfordWest BedfordWest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts/requests. I don't necessarily think all of these need to be addressed - we can maybe chat about it when we have a free cycle or eng workshop.

try:
resp = gen3file.get_presigned_url(self.guid)
url = resp.get("url")
response_status = 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little strange to me - is there a reason we're manually setting the status code instead of logging what we actually got?

Copy link
Contributor Author

@jacob50231 jacob50231 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of returning a requests.Response object Gen3File.get_presigned_url returns the json from that response which does not contain the status code. I'm only able to obtain a status code directly when raise_for_status_and_print_error raises an HTTPError from within get_presigned_url, but since that raises an error any time the status is anything other than 200, if it doesn't raise one I know the status is 200 so I set it manually. (I know it feels hacky, but it was the only way to do it without also updating the get_presigned_url function which would almost certainly cause breaking changes somewhere)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm. I think this is OK for now, but not ideal. For future reference, I'd probably just generalize the variable to something like "status_ok" as a bool and then in the log state we got a successful response. It might be the case that this always returns 200 for success now, but there are a bunch of types of successful HTTP responses and it just feels a bit too manual to me!

But I think this is OK for now and no need to change. Thanks!

)
except:
download_success = -1
self.download_status = download_success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this line be inside the if block? I imagine we wouldn't want to set download_status to download_success if we weren't inside that block, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bad name for this variable. I'm updating the script to not use the variable at all and instead directly change self.download_status

return


class Records:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a "Record" class and "Records" class feels a bit like an anti-pattern to me. It might just be a naming issue, though. Docstrings for the classes explaining their intent might be helpful. If this class is meant more as a utility class to parse input into multiple records (which I think is the case after scanning over it a bit more), a more appropriate name/noun might be something like "RecordCreator" or "RecordParser".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's the latter. Renaming the class to make this more clear.

Args:
manifest (str): the location of a manifest file
"""
if manifest[-3:] == "tsv":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to validate the length of the manifest before doing this just in case it's empty or some such and then throw a more helpful error rather than the default invalid index you'd get.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the validation more here good catch!

for acl in row["acl"].split(" "):
if acl != "admin":
for url in row["url"].split(" "):
if "://" not in url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we continue here rather than throwing an error? This would be an invalid url format, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to throw an error

url.split("/")[2],
)
key = (bucket, protocol, acl)
if key not in self.record_dict or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is getting a little funky here. It may be accurate, but a comment or docstring for the method might be useful to make it a little clearer. Possibly even worth splitting this out a little given how deeply nested it is. Maybe worth separating out url parsing for this purpose into its own utility? Just some thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set some comments to make it clearer. If you still want me to separate this out I can.

url_parsed = True

if url_parsed == False:
logger.warning(f"No url parsed for record {guid}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it might be more appropriate as an error, no? If we want to ensure everything gets validated and we don't exit early, we could store a list of anything that errored out and then throw an error at the end and provide the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

}
)

self.download_results = download_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we don't just use self.download_results from the start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should've been changed earlier when I added download_results as a class attribute. Good catch!

Copy link
Contributor

@BedfordWest BedfordWest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mergeable now. Thanks for all the great work on this!

@jacob50231 jacob50231 merged commit 5bbe468 into master Dec 23, 2024
8 checks passed
@jacob50231 jacob50231 deleted the feat/add-post-indexing-validation branch December 23, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants