-
Notifications
You must be signed in to change notification settings - Fork 4
Add Euclid MER HATS Parquet notebook #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. If you plan to push many commits here while developing, we should consider temporarily turning off execution for the rendering, too.
I'm not sure we should do the numpy uninstall trick in a notebook, it's bad enough that we have installs in there 😅 (That said, I wonder why the install command is not picking up on the numpy upgrade, after all the minimum dependency is changed due to lash/hats requirements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, some suggestions for swapping out the pip uninstall line.
Sorry about the conf.py conflict, you may want to rebase now. |
Co-authored-by: Brigitta Sipőcz <[email protected]>
Co-authored-by: Brigitta Sipőcz <[email protected]>
Rebased and force pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@troyraen this notebook was a good exercise for me to learn how to access HATS-format data.
The code looks good and was easy to follow, I mostly have comments about text. Please note that my comments are coming from a POV of someone who is new to HATS, LSDB, Dask, etc. so feel free to ignore the ones you think are too obvious for an average reader of this tutorial.
We peeked at the data but we haven't loaded all of it yet. | ||
What we really need in order to create a CMD is the magnitudes, so let's calculate those now. | ||
Appending `.compute()` to the commands will trigger Dask to actually load this data into memory. | ||
It is not strictly necessary, but will allow us to look at the data repeatedly without having to re-load it each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it helpful if the text warns about long-running cell to avoid assuming that I did something wrong. Following cell took ~12min for me locally (VPNed at home). Maybe we can add a rough estimate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add an estimate for Fornax.
You may know already, but in case not, I would expect this to take noticeably longer with your setup (Home -> IPAC -> S3 bucket on east coast) than on Fornax due to proximity to data, time to route through VPN, and that most home internet speeds are slower. But it's convenient! Just fyi, tradeoffs.
Co-authored-by: Jaladh Singhal <[email protected]> Co-authored-by: Brigitta Sipőcz <[email protected]>
Please rebase for sortinh out the conflicting file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion on my HATS notebook, I wonder if we can remove hats import from here and replace hats calls in your code with equivalent lsdb calls?
import os | ||
|
||
import dask.distributed | ||
import hats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import hats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes I am in process of removing hats
from this notebook.
try: | ||
# If running from within IPAC's network (maybe VPN'd in with "tunnel-all"), | ||
# your IP address acts as your credentials and this should just work. | ||
hats.read_hats(euclid_s3_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hats.read_hats(euclid_s3_path) | |
lsdb.read_hats(euclid_s3_path) |
|
||
```{code-cell} | ||
# Load the dataset. | ||
euclid_hats = hats.read_hats(euclid_s3_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
euclid_hats = hats.read_hats(euclid_s3_path) | |
euclid_hats = lsdb.read_hats(euclid_s3_path) |
euclid_hats = hats.read_hats(euclid_s3_path) | ||
|
||
# Visualize the on-sky distribution of objects in the Q1 MER Catalog. | ||
hats.inspection.plot_density(euclid_hats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure of exact lsdb equivalent - skymap_histogram()
?
|
||
```{code-cell} | ||
# Visualize the HEALPix orders of the dataset partitions. | ||
hats.inspection.plot_pixels(euclid_hats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hats.inspection.plot_pixels(euclid_hats) | |
euclid_hats.plot_pixels() |
|
||
```{code-cell} | ||
# Fetch the pyarrow schema from hats. | ||
euclid_hats = hats.read_hats(euclid_s3_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
euclid_hats = hats.read_hats(euclid_s3_path) | |
euclid_hats = lsdb.read_hats(euclid_s3_path) |
```{code-cell} | ||
# Fetch the pyarrow schema from hats. | ||
euclid_hats = hats.read_hats(euclid_s3_path) | ||
schema = euclid_hats.schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how to extract pyarrow schema directly from lsdb catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I'm not sure there's a user-friendly way to get the whole schema with lsdb. I'll check once more, but I think pyarrow will be simpler for this.
I have added several more Euclid Q1 tables (>3x more columns) to this dataset plus a couple of ancillary HATS products (nasa-fornax/fornax-demo-notebooks#416) since this notebook was first drafted. I am redrafting the notebook now and it will be quite a bit different in order to explain the full product and demonstrate things. So I will close this PR and open a new one in order to ease the review process. Thanks for your feedback everyone! |
This PR adds a notebook with an introduction to the HATS version of the Euclid Q1 MER catalogs that IRSA is preparing to release. The dataset is currently in a testing bucket that is available from Fornax and IPAC networks only (see nasa-fornax/fornax-demo-notebooks#394 for details).
Note: Before release, I plan to update both the dataset and the notebook to include the data from the Q1 PHZ (photo-z) catalogs along with the MER data that is already there. Many Euclid use cases will require a redshift -- making this product will give users easier access to that information because they won't have to join the tables themselves. We are interested in adding the spectroscopy catalogs as well but that may or may not happen in this first round.