Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update example measurement sets to a more original representation #6

Open
iancze opened this issue Dec 6, 2023 · 4 comments
Open

Comments

@iancze
Copy link
Contributor

iancze commented Dec 6, 2023

Currently, we store example visibility sets in a .npy or .asdf format uploaded to Zenodo. There is some variation, but generally these are:

  • data (complex visibility), often with shape (nchan, nvis)
  • spatial frequencies $u$ and $v$ in kilo $\lambda$, with shape (nchan, nvis)
  • flag (Boolean), with shape (nchan, nvis)
  • weight, with shape (nchan, nvis)

But it's more efficient to save these as CASA does,

  • $u$, $v$ in meters, shape (nvis)
  • channel frequencies, shape (nchan)
  • data (nchan, nvis) (assuming we average over the npol dimension)
  • flag (nchan, nvis) (assuming we average over the npol dimension)
  • weight (nvis) (assuming we average over the npol dimension, and the weights are not channelized)

I think there are a few benefits to this.

  1. The file size is much smaller, so this should speed up downloads for builds
  2. Tutorials can show the user how to convert from the format their data is likely to be in into the format used by MPoL ($u$ and $v$ in $\lambda$, following MPoL #223)

@jeffjennings we might want to discuss this as part of a larger redesign for MPoL #223 and tutorials (#63).

@jeffjennings
Copy link

Sounds good, can cover it Monday. The commit history of the large versions of the files will also have to be removed, else the file size won't decrease in downloads. https://github.com/newren/git-filter-repo might be useful for this.

@iancze
Copy link
Contributor Author

iancze commented Dec 7, 2023

I'm not sure if the commit history matters for the Zenodo repo, since we're downloading these files directly from the Zenodo repo links?

Separately, it is a good idea to scan the MPoL repo to see what large binary files may be lurking in commits, and whether we can safely remove them.

@jeffjennings
Copy link

Ah right mpoldatasets isn't a dependency of MPoL.

But yes that's a good idea.

@iancze
Copy link
Contributor Author

iancze commented Dec 7, 2023

The mpoldatasets git repo should be pretty lightweight, since it's just source code and Makefiles. The code downloads relevant datasets, does clean / reweighting / averaging etc. to produce large datasets in .npy or .asdf and then uses the Zenodo API to upload them to the Zenodo repository. I think we'll need to update several of the elements of this package to point to the new repo (possibly updated API token) and the other suggested changes in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants