Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Npix partitions with a different file suffix or that are directories #458

Merged
merged 7 commits into from
Mar 5, 2025

Conversation

troyraen
Copy link
Contributor

@troyraen troyraen commented Feb 21, 2025

This PR proposes to implement support for:

  • parquet files having a different suffix than '.parquet'; and/or
  • the Npix level being a directory containing 1+ files

by adding an npix_suffix attribute to TableProperties and using it when constructing the pixel file paths. This attribute would show up in the 'properties' ancillary file as 'hats_npix_suffix'.

These changes would be sufficient to address astronomy-commons/lsdb#435 for this repo. Minor changes will be needed in the lsdb repo in order to use this and I will open that PR soon (update: opened astronomy-commons/lsdb#586). I have made all of the changes in my local code clone and they are working for me -- lsdb.read_hats(…) works with the ZTF DR23 HATS catalogs, which I created with Npix as a directory.

I don't know how these changes might affect use with other filesystems (so far I have only tested against data stored on-premise at IRSA) or if they will cause trouble in other hats or lsdb code that I haven't seen yet. So comments and feedback from folks who know more about those areas will be especially helpful.

Change Description

  • My PR includes a link to the issue that I am addressing

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

@troyraen troyraen self-assigned this Feb 21, 2025
@troyraen troyraen added the enhancement New feature or request label Feb 21, 2025
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (91d59ee) to head (304f679).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/hats/io/file_io/file_io.py 84.61% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #458      +/-   ##
==========================================
- Coverage   93.21%   93.15%   -0.07%     
==========================================
  Files          47       47              
  Lines        2033     2044      +11     
==========================================
+ Hits         1895     1904       +9     
- Misses        138      140       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Feb 21, 2025

Before [91d59ee] After [9704203] Ratio Benchmark (Parameter)
12.0±0.3ms 12.3±0.4ms 1.03 benchmarks.Suite.time_inner_pixel_alignment
1.04±0.03ms 1.06±0.02ms 1.01 benchmarks.time_test_cone_filter_multiple_order
71.8±0.4ms 71.7±0.4ms 1 benchmarks.MetadataSuite.time_load_partition_info_order7
352±3ms 353±2ms 1 benchmarks.Suite.time_outer_pixel_alignment
38.8±0.9ms 38.8±0.5ms 1 benchmarks.Suite.time_pixel_tree_creation
72.3±0.8ms 71.8±0.3ms 0.99 benchmarks.MetadataSuite.time_load_partition_join_info
18.8±0.05ms 18.4±0.06ms 0.98 benchmarks.MetadataSuite.time_load_partition_info_order6
110±5ms 105±0.5ms 0.95 benchmarks.time_test_alignment_even_sky

Click here to view all benchmarks.

@troyraen
Copy link
Contributor Author

@delucchi-cmu will you tag in any other folks who's review will be helpful here?

I have not added any new unit tests, so only the standard, single-file Npix partitions with a '.parquet' suffix are currently covered. I can add a new test dataset(s) and unit test(s) if needed, but I'd like to at least get feedback on the implementation before setting all of that up.

@troyraen troyraen changed the title Support partitions with a different file suffix or that are directories Support Npix partitions with a different file suffix or that are directories Feb 21, 2025
@troyraen
Copy link
Contributor Author

I have not added any new unit tests

I added unit tests and data in astronomy-commons/lsdb#586 that cover both cases described in the bullet points in the main comment above. I'm not sure how you prefer to handle this... do you want similar tests and data in both repos or do you prefer them in one repo over another? (If it matters, neither PR introduces new code that is not covered by existing tests so the coverage shouldn't change either way, though obviously you'll want new tests somewhere to cover the new functionality.)

Copy link
Contributor

@smcguire-cmu smcguire-cmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Troy! This looks good, apart from the issue with the filesystems. Let us know if you'd like to set up some coworking time or a call to try and figure that one out together.

As for unit tests, we generally like to have each package tested independently, so generating the test catalogs in hats too and testing the extra functionality here would be great.

@troyraen
Copy link
Contributor Author

As for unit tests, we generally like to have each package tested independently, so generating the test catalogs in hats too and testing the extra functionality here would be great.

Ok thanks, I'll add them here.

I don't think that pandas supports loading folders over HTTP

Ah, right. I tried that a year or two ago and found the same thing -- pd.read_parquet won't accept a directory. I'll have it pass in a list of files instead. That is working for me locally (thanks for the tip about including the filesystem).

@troyraen troyraen force-pushed the raen/issues/lsdb/435/support-directory-partitions branch from 3b3cdca to 5f20337 Compare March 1, 2025 06:37
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@troyraen
Copy link
Contributor Author

troyraen commented Mar 1, 2025

Added unit tests and data. Rebased onto main and force pushed.

I don't know how to test against data over http (see below). Otherwise, everything is complete that I'm aware of.

@troyraen troyraen requested a review from smcguire-cmu March 1, 2025 06:58
Copy link
Contributor

@delucchi-cmu delucchi-cmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@troyraen troyraen force-pushed the raen/issues/lsdb/435/support-directory-partitions branch from 5f20337 to 304f679 Compare March 5, 2025 20:34
@troyraen
Copy link
Contributor Author

troyraen commented Mar 5, 2025

Confirmed with @delucchi-cmu on slack to go ahead and merge. I can do a follow-up pr if more is needed.

@troyraen troyraen merged commit f08cb51 into main Mar 5, 2025
10 of 12 checks passed
@troyraen troyraen deleted the raen/issues/lsdb/435/support-directory-partitions branch March 5, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants