Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign UVDataset with Pytorch idioms in mind #162

Closed
iancze opened this issue Feb 22, 2023 · 1 comment · Fixed by #248
Closed

Redesign UVDataset with Pytorch idioms in mind #162

iancze opened this issue Feb 22, 2023 · 1 comment · Fixed by #248
Assignees

Comments

@iancze
Copy link
Collaborator

iancze commented Feb 22, 2023

In the first part of the general effort to redesign the visibility datasets (#126), we should redesign/update the UVDataset class. Currently, this class is used nowhere in the codebase, so it shouldn't be much difficulty to experiment with new ideas (we could also delete this object, if we decide that's the right course).

The idea is that UVDataset (or some renamed version of it) will be for interacting with what we are calling the "loose" visibilities. I.e., the ungridded visibilities obtained raw from some measurement set. For example, a typical ALMA measurement set might contain 300,000+ individual visibility measurements. Because dealing with so many visibility points is computationally expensive, most users will want to interact with a GriddedDataset (whose redesign is discussed in #163). A GriddedDataset requires some special indexing to match up with the FourierCube output, so redesigning the UVDataset is probably the more straightforward of the issues, even if it's not the first object most people will use. And, once we figure out some of the larger redesign issues, we should have a better idea of how to redesign GriddedDataset.

There there are several instances where the user would want to interact with the loose, ungridded visibilities where a UVDateset would be helpful. This is now possible thanks to the NuFFT (#78) in the codebase.

The goal of this issue, generally, is to align our dataset objects with as many of the Pytorch idioms as possible, described here, here, and here.

The idea is that the user will instantiate the dataset with numpy arrays for u, v, weight, and data (much the same way DataAverager is instantiated).

Things we should think about:

  • Is there any pre-processing that should be done on the arrays? Error checking?
  • How does the device location interact with creation/moving/slicing?
  • Should this be an IterableDataset or any of the other types of datasets provided by Pytorch?
  • How should the dataset use batch dimensions to parallelize when we have multiple channels to the data, as in a spectral cube?
  • How should the dataset use sub-batches, in the limit that we have many, many visibilities and we'd like to use something like stochastic gradient descent?
  • other considerations?

We may also want a UVDataset to contain a routine that converts it to a GriddedDataset (by passing through the gridding.DataAverager).

@iancze
Copy link
Collaborator Author

iancze commented Dec 22, 2023

We've made considerable design progress on this through the adoption of the SGD paradigm. While addressing this issue, we should also add

  • custom per-EB Sampler objects
  • tutorial/documentation on how to use batching via DataLoader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants