Adding post_indexing_validation module #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jacob50231 merged 10 commits into master from feat/add-post-indexing-validation

Dec 23, 2024

Contributor

jacob50231 commented Dec 17, 2024

Link to JIRA ticket if there is one:

New Features

Adds a module for post-indexing validation

jacob50231 and others added 3 commits

December 17, 2024 15:22


          Add post_indexing_validation module

096ca82


          Add tests

0dcb2ca


          Apply automatic documentation changes

f7230aa

github-actions bot commented Dec 17, 2024

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.


          Apply automatic documentation changes

159989b

BedfordWest suggested changes

View reviewed changes

Contributor

BedfordWest left a comment

A few thoughts/requests. I don't necessarily think all of these need to be addressed - we can maybe chat about it when we have a free cycle or eng workshop.

gen3/tools/indexing/post_indexing_validation.py

+                      try:
+                          resp = gen3file.get_presigned_url(self.guid)
+                          url = resp.get("url")
+                          response_status = 200

Contributor

BedfordWest Dec 19, 2024

This is a little strange to me - is there a reason we're manually setting the status code instead of logging what we actually got?

Contributor Author

jacob50231 Dec 19, 2024 •

edited

Loading

So instead of returning a requests.Response object Gen3File.get_presigned_url returns the json from that response which does not contain the status code. I'm only able to obtain a status code directly when raise_for_status_and_print_error raises an HTTPError from within get_presigned_url, but since that raises an error any time the status is anything other than 200, if it doesn't raise one I know the status is 200 so I set it manually. (I know it feels hacky, but it was the only way to do it without also updating the get_presigned_url function which would almost certainly cause breaking changes somewhere)

Contributor

BedfordWest Dec 23, 2024

Hmmmm. I think this is OK for now, but not ideal. For future reference, I'd probably just generalize the variable to something like "status_ok" as a bool and then in the log state we got a successful response. It might be the case that this always returns 200 for success now, but there are a bunch of types of successful HTTP responses and it just feels a bit too manual to me!

But I think this is OK for now and no need to change. Thanks!

gen3/tools/indexing/post_indexing_validation.py Outdated

+                              )
+                          except:
+                              download_success = -1
+                      self.download_status = download_success

Contributor

BedfordWest Dec 19, 2024

Shouldn't this line be inside the if block? I imagine we wouldn't want to set download_status to download_success if we weren't inside that block, right?

Contributor Author

jacob50231 Dec 19, 2024

This was a bad name for this variable. I'm updating the script to not use the variable at all and instead directly change self.download_status

gen3/tools/indexing/post_indexing_validation.py Outdated

		return


		class Records:

Contributor

BedfordWest Dec 19, 2024

Having a "Record" class and "Records" class feels a bit like an anti-pattern to me. It might just be a naming issue, though. Docstrings for the classes explaining their intent might be helpful. If this class is meant more as a utility class to parse input into multiple records (which I think is the case after scanning over it a bit more), a more appropriate name/noun might be something like "RecordCreator" or "RecordParser".

Contributor Author

jacob50231 Dec 19, 2024

Yeah it's the latter. Renaming the class to make this more clear.

gen3/tools/indexing/post_indexing_validation.py Outdated

+                      Args:
+                          manifest (str): the location of a manifest file
+                      """
+                      if manifest[-3:] == "tsv":

Contributor

BedfordWest Dec 19, 2024

Might want to validate the length of the manifest before doing this just in case it's empty or some such and then throw a more helpful error rather than the default invalid index you'd get.

Contributor Author

jacob50231 Dec 19, 2024

Will update the validation more here good catch!

gen3/tools/indexing/post_indexing_validation.py Outdated

+                              for acl in row["acl"].split(" "):
+                                  if acl != "admin":
+                                      for url in row["url"].split(" "):
+                                          if "://" not in url:

Contributor

BedfordWest Dec 19, 2024

Is there a reason we continue here rather than throwing an error? This would be an invalid url format, no?

Contributor Author

jacob50231 Dec 19, 2024

Changed to throw an error

gen3/tools/indexing/post_indexing_validation.py Outdated

+                                                  url.split("/")[2],
+                                              )
+                                              key = (bucket, protocol, acl)
+                                              if key not in self.record_dict or (

Contributor

BedfordWest Dec 19, 2024

The logic is getting a little funky here. It may be accurate, but a comment or docstring for the method might be useful to make it a little clearer. Possibly even worth splitting this out a little given how deeply nested it is. Maybe worth separating out url parsing for this purpose into its own utility? Just some thoughts.

Contributor Author

jacob50231 Dec 19, 2024

Set some comments to make it clearer. If you still want me to separate this out I can.

gen3/tools/indexing/post_indexing_validation.py Outdated

+                                              url_parsed = True
+                              if url_parsed == False:
+                                  logger.warning(f"No url parsed for record {guid}")

Contributor

BedfordWest Dec 19, 2024

This feels like it might be more appropriate as an error, no? If we want to ensure everything gets validated and we don't exit early, we could store a list of anything that errored out and then throw an error at the end and provide the list.

Contributor Author

jacob50231 Dec 19, 2024

Updated

gen3/tools/indexing/post_indexing_validation.py Outdated

+                              }
+                          )
+                      self.download_results = download_results

Contributor

BedfordWest Dec 19, 2024

is there a reason we don't just use self.download_results from the start?

Contributor Author

jacob50231 Dec 19, 2024

This should've been changed earlier when I added download_results as a class attribute. Good catch!

jacob50231 and others added 6 commits

December 19, 2024 15:43


          make requested changes

2f2b213


          Merge branch 'feat/add-post-indexing-validation' of github.com:uc-cdi…

8ff85d1

…s/gen3sdk-python into feat/add-post-indexing-validation


          Apply automatic documentation changes

792d152


          One last change

3f8c02d


          Merge branch 'feat/add-post-indexing-validation' of github.com:uc-cdi…

b9033ea

…s/gen3sdk-python into feat/add-post-indexing-validation


          Apply automatic documentation changes

d987c7f

BedfordWest approved these changes

View reviewed changes

Contributor

BedfordWest left a comment

I think this is mergeable now. Thanks for all the great work on this!

jacob50231 merged commit 5bbe468 into master

8 checks passed

jacob50231 deleted the feat/add-post-indexing-validation branch

December 23, 2024 19:20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet