-
-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up mast.Observations.get_cloud_uris()
#2145
Conversation
mast.Observations.get_cloud_uris
by obtaining multiple URIs at oncemast.Observations.get_cloud_uris()
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2145 +/- ##
==========================================
+ Coverage 66.10% 66.44% +0.34%
==========================================
Files 234 235 +1
Lines 18065 18123 +58
==========================================
+ Hits 11941 12042 +101
+ Misses 6124 6081 -43 ☔ View full report in Codecov by Sentry. |
This now needs and rebase. Also, @jaymedina could you assign it to the person who are supposed to review this, or cc them on the PR? Thanks! |
Hi @bsipocz I'll review this in conjunction with @tomdonaldson (I think he started a review already?) Edit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, the main suggestion I have is what I mentioned about the GALEX relative path tweak that I discovered was necessary for GALEX on S3. Also added some minor comments, but in general this looks good.
I'd say the first thing to do right now is rebase and rerun unit tests, to verify that GALEX indeed breaks without a tweak to the mast_relative_path
output.
|
||
return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products] | ||
paths = utils.mast_relative_path(data_products["dataURI"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break for GALEX data retrieval from the cloud, unfortunately. See here for how I tweaked this output in a recent PR following the new GALEX availability on the cloud:
astroquery/astroquery/mast/cloud.py
Line 119 in 2b3d954
if 'galex' in path: |
Your two options would be to either:
- Make a tweak similar to this in
get_cloud_uri_list
- Make the tweak in
mast_relative_path
- this is probably the cleanest option because then we can remove my tweak fromcloud.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re this suggestion: in the cloud WG we're working on generalizing these methods and move them out from mast. So a tweak at the place of usage, or a generically usable kwarg would be preferable rather than a hard-wired conditional on mast specific strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I did was move the path.strip()
logic out of cloud.py
and into util.py
's mast_relative_path
function. In other words, I essentially went with option 2.
listed above. And because this is a MAST-specific function in utils
, it makes sense for MAST-specific tweaks to take place in there. Let me know if that works for you!
@barentsen @jaymedina - what are the plans with this PR? |
Ping @barentsen @jaymedina @tomdonaldson, do you have any plans for finishing up this PR? I expect it's reasonably close for being ready. |
ping @jaymedina! |
I'll take a look at this and make updates accordingly. |
… using a single API request
b42de94
to
94d9769
Compare
8438d9e
to
1ce2697
Compare
I've rebased, adapted the code to account for GALEX cloud products, updated error handling and unit tests, moved the MAST-specific string manipulation out of |
6916a1b
to
cfdfcbc
Compare
f181c56
to
0a30aa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up and pushing it through the finish line.
I have some minor comments, mostly asking for clarifications.
astroquery/mast/cloud.py
Outdated
path = path.lstrip("/mast/") | ||
# Making sure we got at least 1 URI from the query above. | ||
if len(uri_list) == 0: | ||
raise InvalidQueryError("No products found on the cloud for this query.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an InvalidQueryError or rather a NoResultsWarning
? We haven't previously raised the error, but if you think the query has an error in it, rather than just there is no result, then we can raise it from now on (probably would require its own changelog entry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to raise an error rather than a warning so that the output error message is more helpful. Here is what it looks like when I use a NoResultsWarning
:
The final TypeError
is an indirect result of the empty uri list, so it's a vague error message.
Raising a custom InvalidQueryError
makes things more clear:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe then a conditional is needed to return the empty result early, before going into the Typeerror. The point is basically to use the InvalidqueryError when there is an issue with the user input. Not having data, imo is not in this category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, a user will realize that their call to Observations.get_cloud_uri(product)
returns an empty list before calling Observations.download_file
, since they have to feed the uri in string form and by default the uri gets returned as a list even when it's 1 URI. So this error handling thing might be redundant.
|
||
warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning) | ||
return None | ||
return uri_list[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only the first element, what is expected to be included in the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output for get_cloud_uri_list
is always a list, even when the output returns only 1 URI, so the sole element gets indexed out before being returned. I just committed some commentary above this line for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed a good way to handle the one element lists. I wonder what happens when there are more than one, would the rest just be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wouldn't be a case where a request for a single cloud URI would return more than one because of L171 in util.py
of this branch, but I can raise an AssertionError here to ensure that there is only 1 element in the list, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really going into bikeshedding now, but if that's the case, why does it have to be a list to begin with?
Also, L173 is never reached?
Maybe just a comment about this would be enough, no need for the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just revisiting this branch in a while, what assertion are you referring to? I added a comment above L118 in this branch explaining why the zero-indexing has to happen. It's because uri_list
is declared as a list in get_cloud_uri_list
which get_cloud_uri
wraps around. Everything is wrapped around get_cloud_uri_list
because that method calls utils.mast_relative_path
which has the new chunking functionality that Geert implemented, which speeds up the call by making 1 request for multiple URIs rather than 1 request per URI.
astroquery/mast/cloud.py
Outdated
# Use `head_object` to verify that the product is available on S3 (not all products are) | ||
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path) | ||
if include_bucket: | ||
path = "s3://{}/{}".format(self.pubdata_bucket, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily like the overwriting of variables here as when this has to be debugged it's a bit more difficult to distinguish between the path
from the iteration vs the path here. Maybe use s3_path
?
{"uri": chunk}) | ||
json_response = response.json() | ||
for uri in chunk: | ||
path = json_response.get(uri[1])["path"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what's expected to be in uri
here, could you explain it a bit, so I see why the second element?
@@ -405,28 +405,32 @@ def test_get_cloud_uri(self): | |||
|
|||
def test_get_cloud_uris(self): | |||
pytest.importorskip("boto3") | |||
test_obs_id = '25568122' | |||
test_obs_ids = ["25568122", "31411"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this with pytest.mark.parametrize
instead of an internal loop
Hello @barentsen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
Close/reopen to retrigger CI (as logs were long gone now) |
Needs a rebase for RTD fix and some PEP8 fixes before final review. |
Hi @bsipocz and @keflavich, I'm tagging the interim |
@astrojimig - I suppose the conflict is coming from the changes from your recently merged panstarss branch. Anyway, it would be nice to finally wrap this PR up and merge it. Could you try and rebase this one? I'm not sure you inherited the access to Geert's branch from Jenny. |
@bsipocz Sure, I'll try to work on this next week! |
@bsipocz Working on this now! This might just be my own misunderstanding of GitHub, but do I need some sort of permissions to push to this branch? Whenever I try, I get a permission denied error. Is there an alternate method I should be using? |
You cannot push to someone else's fork by default, so you will need to take the applicable changes from here. I have reviewed this a couple of times, but things also have changed in the module, so I'm not certain which way is more efficient and useful (trying to keep these as is and rebase and fix from there, or look into the changes and apply the ones that are still relevant to your other PR (and then just rename your method and add fixes to it so old usage of get_cloud_uris() will work with it, too) |
Changes integrated into #3064 |
When using
astroquery.mast
, it is currently very slow to download multiple data products in cloud-enabled mode.The main reason for the poor performance is that the
Observations.get_cloud_uris()
helper method is very slow at retrieving AWS S3 URIs. For example, the following example requires >2 minutes to obtain URIs for 186 data products:This is unfortunate because slow data access makes the cloud-enabled mode less attractive.
The main reason
Observations.get_cloud_uris()
is slow is because every data products triggers a separate HTTP request to MAST'spath_lookup
API via theastroquery.mast.utils.mast_relative_path()
helper function.This PR proposes to modify these functions such that MAST's
path_lookup
API is called in chunks of 50 products at once, allowingObservations.get_cloud_uris()
to execute much faster. For example, the demo above is ~7x faster when using this PR branch:(Note: the remaining 20s used by
get_cloud_uris()
is spent polling every cloud URI to verify that it exists before returning it to the user, because not every MAST product is mirrored in the cloud. This will be harder to speed up, but fortunately this may be fast when running in the cloud.)