Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up mast.Observations.get_cloud_uris() #2145

Closed
wants to merge 11 commits into from

Conversation

barentsen
Copy link
Contributor

@barentsen barentsen commented Sep 8, 2021

When using astroquery.mast, it is currently very slow to download multiple data products in cloud-enabled mode.

The main reason for the poor performance is that the Observations.get_cloud_uris() helper method is very slow at retrieving AWS S3 URIs. For example, the following example requires >2 minutes to obtain URIs for 186 data products:

Screen Shot 2021-09-07 at 5 44 33 PM

This is unfortunate because slow data access makes the cloud-enabled mode less attractive.

The main reason Observations.get_cloud_uris() is slow is because every data products triggers a separate HTTP request to MAST's path_lookup API via the astroquery.mast.utils.mast_relative_path() helper function.

This PR proposes to modify these functions such that MAST's path_lookup API is called in chunks of 50 products at once, allowing Observations.get_cloud_uris() to execute much faster. For example, the demo above is ~7x faster when using this PR branch:

Screen Shot 2021-09-07 at 6 02 26 PM

(Note: the remaining 20s used by get_cloud_uris() is spent polling every cloud URI to verify that it exists before returning it to the user, because not every MAST product is mirrored in the cloud. This will be harder to speed up, but fortunately this may be fast when running in the cloud.)

@barentsen barentsen changed the title Speed up mast.Observations.get_cloud_uris by obtaining multiple URIs at once Speed up mast.Observations.get_cloud_uris() Sep 8, 2021
@codecov
Copy link

codecov bot commented Sep 8, 2021

Codecov Report

Attention: Patch coverage is 2.27273% with 43 lines in your changes missing coverage. Please review.

Project coverage is 66.44%. Comparing base (aff2ac5) to head (4c53a19).
Report is 839 commits behind head on main.

Files with missing lines Patch % Lines
astroquery/mast/cloud.py 0.00% 26 Missing ⚠️
astroquery/mast/utils.py 5.55% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2145      +/-   ##
==========================================
+ Coverage   66.10%   66.44%   +0.34%     
==========================================
  Files         234      235       +1     
  Lines       18065    18123      +58     
==========================================
+ Hits        11941    12042     +101     
+ Misses       6124     6081      -43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bsipocz
Copy link
Member

bsipocz commented Feb 24, 2022

This now needs and rebase. Also, @jaymedina could you assign it to the person who are supposed to review this, or cc them on the PR? Thanks!

@bsipocz bsipocz removed the request for review from ceb8 February 24, 2022 04:12
@jaymedina
Copy link
Contributor

jaymedina commented Feb 24, 2022

Hi @bsipocz I'll review this in conjunction with @tomdonaldson (I think he started a review already?)

Edit:
I've started a review for this PR @barentsen. This looks really great, and the improvement in time is fantastic. This will need a rebase and re-testing before anything else tho, because I'm pretty sure get_cloud_uri_list will break for GALEX data retrieval from the cloud. GALEX data retrieval requires a minor tweak in the output of mast_relative_path because of the location it's been uploaded onto S3. I've made a review comment below that goes into more detail.

@jaymedina jaymedina self-requested a review February 24, 2022 19:56
Copy link
Contributor

@jaymedina jaymedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid, the main suggestion I have is what I mentioned about the GALEX relative path tweak that I discovered was necessary for GALEX on S3. Also added some minor comments, but in general this looks good.

I'd say the first thing to do right now is rebase and rerun unit tests, to verify that GALEX indeed breaks without a tweak to the mast_relative_path output.


return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
paths = utils.mast_relative_path(data_products["dataURI"])
Copy link
Contributor

@jaymedina jaymedina Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break for GALEX data retrieval from the cloud, unfortunately. See here for how I tweaked this output in a recent PR following the new GALEX availability on the cloud:

if 'galex' in path:

Your two options would be to either:

  1. Make a tweak similar to this in get_cloud_uri_list
  2. Make the tweak in mast_relative_path - this is probably the cleanest option because then we can remove my tweak from cloud.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re this suggestion: in the cloud WG we're working on generalizing these methods and move them out from mast. So a tweak at the place of usage, or a generically usable kwarg would be preferable rather than a hard-wired conditional on mast specific strings.

Copy link
Contributor

@jaymedina jaymedina Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did was move the path.strip() logic out of cloud.py and into util.py's mast_relative_path function. In other words, I essentially went with option 2. listed above. And because this is a MAST-specific function in utils, it makes sense for MAST-specific tweaks to take place in there. Let me know if that works for you!

astroquery/mast/cloud.py Show resolved Hide resolved
@bsipocz bsipocz added this to the v0.4.7 milestone Jul 26, 2022
@bsipocz
Copy link
Member

bsipocz commented Sep 8, 2022

@barentsen @jaymedina - what are the plans with this PR?

@bsipocz
Copy link
Member

bsipocz commented Nov 9, 2022

Ping @barentsen @jaymedina @tomdonaldson, do you have any plans for finishing up this PR? I expect it's reasonably close for being ready.

@bsipocz
Copy link
Member

bsipocz commented Jul 6, 2023

ping @jaymedina!

@jaymedina
Copy link
Contributor

I'll take a look at this and make updates accordingly.

@jaymedina
Copy link
Contributor

I've rebased, adapted the code to account for GALEX cloud products, updated error handling and unit tests, moved the MAST-specific string manipulation out of cloud.py, and squashed the commits into 4. I think this is ready for another review @bsipocz , thanks in advance.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up and pushing it through the finish line.

I have some minor comments, mostly asking for clarifications.

path = path.lstrip("/mast/")
# Making sure we got at least 1 URI from the query above.
if len(uri_list) == 0:
raise InvalidQueryError("No products found on the cloud for this query.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an InvalidQueryError or rather a NoResultsWarning? We haven't previously raised the error, but if you think the query has an error in it, rather than just there is no result, then we can raise it from now on (probably would require its own changelog entry)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to raise an error rather than a warning so that the output error message is more helpful. Here is what it looks like when I use a NoResultsWarning:

image

The final TypeError is an indirect result of the empty uri list, so it's a vague error message.

Raising a custom InvalidQueryError makes things more clear:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe then a conditional is needed to return the empty result early, before going into the Typeerror. The point is basically to use the InvalidqueryError when there is an issue with the user input. Not having data, imo is not in this category.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, a user will realize that their call to Observations.get_cloud_uri(product) returns an empty list before calling Observations.download_file, since they have to feed the uri in string form and by default the uri gets returned as a list even when it's 1 URI. So this error handling thing might be redundant.


warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning)
return None
return uri_list[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only the first element, what is expected to be included in the rest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output for get_cloud_uri_list is always a list, even when the output returns only 1 URI, so the sole element gets indexed out before being returned. I just committed some commentary above this line for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a good way to handle the one element lists. I wonder what happens when there are more than one, would the rest just be ignored?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wouldn't be a case where a request for a single cloud URI would return more than one because of L171 in util.py of this branch, but I can raise an AssertionError here to ensure that there is only 1 element in the list, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really going into bikeshedding now, but if that's the case, why does it have to be a list to begin with?

Also, L173 is never reached?

Maybe just a comment about this would be enough, no need for the assertion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just revisiting this branch in a while, what assertion are you referring to? I added a comment above L118 in this branch explaining why the zero-indexing has to happen. It's because uri_list is declared as a list in get_cloud_uri_list which get_cloud_uri wraps around. Everything is wrapped around get_cloud_uri_list because that method calls utils.mast_relative_path which has the new chunking functionality that Geert implemented, which speeds up the call by making 1 request for multiple URIs rather than 1 request per URI.

# Use `head_object` to verify that the product is available on S3 (not all products are)
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
if include_bucket:
path = "s3://{}/{}".format(self.pubdata_bucket, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily like the overwriting of variables here as when this has to be debugged it's a bit more difficult to distinguish between the path from the iteration vs the path here. Maybe use s3_path?

{"uri": chunk})
json_response = response.json()
for uri in chunk:
path = json_response.get(uri[1])["path"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what's expected to be in uri here, could you explain it a bit, so I see why the second element?

@@ -405,28 +405,32 @@ def test_get_cloud_uri(self):

def test_get_cloud_uris(self):
pytest.importorskip("boto3")
test_obs_id = '25568122'
test_obs_ids = ["25568122", "31411"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this with pytest.mark.parametrize instead of an internal loop

@bsipocz bsipocz closed this Nov 9, 2023
@bsipocz bsipocz reopened this Nov 9, 2023
@pep8speaks
Copy link

Hello @barentsen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 115:24: E711 comparison to None should be 'if cond is None:'

@bsipocz
Copy link
Member

bsipocz commented Nov 9, 2023

Close/reopen to retrigger CI (as logs were long gone now)

@bsipocz
Copy link
Member

bsipocz commented Nov 9, 2023

Needs a rebase for RTD fix and some PEP8 fixes before final review.

@jaymedina
Copy link
Contributor

jaymedina commented Nov 28, 2023

Hi @bsipocz and @keflavich, I'm tagging the interim astroquery.mast maintainer, Julie Imig (@astrojimig). She will be taking over remaining PRs and will be the liaison for MAST-related issues after I'm gone on December 1st.

@bsipocz
Copy link
Member

bsipocz commented Dec 8, 2023

@astrojimig - I suppose the conflict is coming from the changes from your recently merged panstarss branch. Anyway, it would be nice to finally wrap this PR up and merge it.

Could you try and rebase this one? I'm not sure you inherited the access to Geert's branch from Jenny.

@astrojimig
Copy link

@bsipocz Sure, I'll try to work on this next week!

@bsipocz bsipocz removed this from the v0.4.7 milestone Feb 17, 2024
@snbianco
Copy link
Contributor

@bsipocz Working on this now! This might just be my own misunderstanding of GitHub, but do I need some sort of permissions to push to this branch? Whenever I try, I get a permission denied error. Is there an alternate method I should be using?

@bsipocz
Copy link
Member

bsipocz commented Jul 12, 2024

@bsipocz Working on this now! This might just be my own misunderstanding of GitHub, but do I need some sort of permissions to push to this branch? Whenever I try, I get a permission denied error. Is there an alternate method I should be using?

You cannot push to someone else's fork by default, so you will need to take the applicable changes from here. I have reviewed this a couple of times, but things also have changed in the module, so I'm not certain which way is more efficient and useful (trying to keep these as is and rebase and fix from there, or look into the changes and apply the ones that are still relevant to your other PR (and then just rename your method and add fixes to it so old usage of get_cloud_uris() will work with it, too)

@snbianco
Copy link
Contributor

Changes integrated into #3064

@snbianco snbianco closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants