-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding post_indexing_validation module #246
Conversation
The style in this PR agrees with This formatting comment was generated automatically by a script in uc-cdis/wool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts/requests. I don't necessarily think all of these need to be addressed - we can maybe chat about it when we have a free cycle or eng workshop.
try: | ||
resp = gen3file.get_presigned_url(self.guid) | ||
url = resp.get("url") | ||
response_status = 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little strange to me - is there a reason we're manually setting the status code instead of logging what we actually got?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of returning a requests.Response object Gen3File.get_presigned_url returns the json from that response which does not contain the status code. I'm only able to obtain a status code directly when raise_for_status_and_print_error raises an HTTPError from within get_presigned_url, but since that raises an error any time the status is anything other than 200, if it doesn't raise one I know the status is 200 so I set it manually. (I know it feels hacky, but it was the only way to do it without also updating the get_presigned_url function which would almost certainly cause breaking changes somewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm. I think this is OK for now, but not ideal. For future reference, I'd probably just generalize the variable to something like "status_ok" as a bool and then in the log state we got a successful response. It might be the case that this always returns 200 for success now, but there are a bunch of types of successful HTTP responses and it just feels a bit too manual to me!
But I think this is OK for now and no need to change. Thanks!
) | ||
except: | ||
download_success = -1 | ||
self.download_status = download_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this line be inside the if block? I imagine we wouldn't want to set download_status to download_success if we weren't inside that block, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bad name for this variable. I'm updating the script to not use the variable at all and instead directly change self.download_status
return | ||
|
||
|
||
class Records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a "Record" class and "Records" class feels a bit like an anti-pattern to me. It might just be a naming issue, though. Docstrings for the classes explaining their intent might be helpful. If this class is meant more as a utility class to parse input into multiple records (which I think is the case after scanning over it a bit more), a more appropriate name/noun might be something like "RecordCreator" or "RecordParser".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's the latter. Renaming the class to make this more clear.
Args: | ||
manifest (str): the location of a manifest file | ||
""" | ||
if manifest[-3:] == "tsv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to validate the length of the manifest before doing this just in case it's empty or some such and then throw a more helpful error rather than the default invalid index you'd get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the validation more here good catch!
for acl in row["acl"].split(" "): | ||
if acl != "admin": | ||
for url in row["url"].split(" "): | ||
if "://" not in url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we continue here rather than throwing an error? This would be an invalid url format, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to throw an error
url.split("/")[2], | ||
) | ||
key = (bucket, protocol, acl) | ||
if key not in self.record_dict or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is getting a little funky here. It may be accurate, but a comment or docstring for the method might be useful to make it a little clearer. Possibly even worth splitting this out a little given how deeply nested it is. Maybe worth separating out url parsing for this purpose into its own utility? Just some thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set some comments to make it clearer. If you still want me to separate this out I can.
url_parsed = True | ||
|
||
if url_parsed == False: | ||
logger.warning(f"No url parsed for record {guid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it might be more appropriate as an error, no? If we want to ensure everything gets validated and we don't exit early, we could store a list of anything that errored out and then throw an error at the end and provide the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
} | ||
) | ||
|
||
self.download_results = download_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we don't just use self.download_results from the start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should've been changed earlier when I added download_results as a class attribute. Good catch!
…s/gen3sdk-python into feat/add-post-indexing-validation
…s/gen3sdk-python into feat/add-post-indexing-validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mergeable now. Thanks for all the great work on this!
Link to JIRA ticket if there is one:
New Features
Adds a module for post-indexing validation