Redesign UVDataset with Pytorch idioms in mind

In the first part of the general effort to redesign the visibility datasets (#126), we should redesign/update the [UVDataset class](https://github.com/MPoL-dev/MPoL/blob/main/src/mpol/datasets.py#L185). Currently, this class is used *nowhere* in the codebase, so it shouldn't be much difficulty to experiment with new ideas (we could also delete this object, if we decide that's the right course).

The idea is that `UVDataset` (or some renamed version of it) will be for interacting with what we are calling the "loose" visibilities. I.e., the ungridded visibilities obtained raw from some measurement set. For example, a typical ALMA measurement set might contain 300,000+ individual visibility measurements. Because dealing with so many visibility points is computationally expensive, most users will want to interact with a `GriddedDataset` (whose redesign is discussed in #163). A `GriddedDataset` requires some special indexing to match up with the `FourierCube` output, so redesigning the `UVDataset` is probably the more straightforward of the issues, even if it's not the first object most people will use. And, once we figure out some of the larger redesign issues, we should have a better idea of how to redesign `GriddedDataset`.

There there are several instances where the user would want to interact with the loose, ungridded visibilities where a `UVDateset` would be helpful. This is now possible thanks to the NuFFT (#78) in the codebase. 

The goal of this issue, generally, is to align our dataset objects with as many of the Pytorch idioms as possible, described [here](https://pytorch.org/docs/stable/data.html), [here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html), and [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

The idea is that the user will instantiate the dataset with numpy arrays for u, v, weight, and data (much the same way `DataAverager` is instantiated). 

Things we should think about:
* Is there any pre-processing that should be done on the arrays? Error checking?
* How does the `device` location interact with creation/moving/slicing?
* Should this be an [`IterableDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) or any of the other types of datasets provided by Pytorch?
* How should the dataset use batch dimensions to parallelize when we have multiple channels to the data, as in a spectral cube?
* How should the dataset use sub-batches, in the limit that we have many, many visibilities and we'd like to use something like stochastic gradient descent?
* other considerations?

We may also want a `UVDataset` to contain a routine that converts it to a `GriddedDataset` (by passing through the `gridding.DataAverager`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign UVDataset with Pytorch idioms in mind #162

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redesign UVDataset with Pytorch idioms in mind #162

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions