Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation workflow #99

Closed
iancze opened this issue Sep 14, 2022 · 7 comments
Closed

Cross validation workflow #99

iancze opened this issue Sep 14, 2022 · 7 comments
Assignees
Milestone

Comments

@iancze
Copy link
Collaborator

iancze commented Sep 14, 2022

Cross validation is useful for determining optimal (or at least good enough) parameter settings for regularization.

Currently, though, most of the functionality for doing this exists outside of the MPoL package itself. This is partially by design and mirrors the way some PyTorch projects are set up with respect to functionality / optimizers. However, the current K-fold CV workflow is somewhat clunky and there are likely areas of improvement.

Describe the solution you'd like

  • Catalogue issues in the CV workflow using this issue @briannazawadzki
  • Explore potential designs for solutions. I think it makes sense to try to keep the core MPoL package focused on the evaluation of an image relative to interferometric data and have CV routines live in a separate MPoL-dev affiliated package (e.g., the way visread or mpoldatasets do). But there probably are a few changes to MPoL itself that would be useful.
@iancze
Copy link
Collaborator Author

iancze commented Sep 17, 2022

Possibly useful for visualization (in addition to tensorboard): https://napari.org/stable/index.html

@iancze
Copy link
Collaborator Author

iancze commented Nov 22, 2022

On a related but possibly separate note, @jeffjennings also mentioned that it might be interested to ensure that cross-validation blocks should always roughly have the same 1D weighted baseline distribution.

@jeffjennings
Copy link
Contributor

jeffjennings commented Jan 31, 2023

I think one aspect of the current cross-val workflow that could be improved is the train/test set division in KFoldCrossValidatorGridded, moving from standard k-fold to stratified k-fold. It would address that:

  • Currently the data are divided into a list of cells using a Dartboard, and this cell list is then split into train/test sets. Because the number of visibilities in cells can vary a lot, the training sets often don't have a similar number of points (same for the test sets). Using a single dataset (of real obs.) as a trial case, the size of the test set varies by up to 35% for k=5.
    • In turn the ratio of training:test set size can vary a lot, from 19% to 34% in the trial case.
  • I don't think it's best to withold grouped chunks of (u,v) space (whole dartboard cells) - the model should be able to accurately predict data it hasn't seen, but the test data should still be similar to the training data. There might be problematic edge cases too, like a highly asymmetric source.
    • Using dartboard cells also makes it harder to ensure that training sets cover a similar baseline distribution.

A stratified k-fold approach would ensure the training sets have almost exactly the same number of points, including the same number in each of several baseline bins. This also ensures the train:test set size ratio is constant and ~exactly a chosen value.

@briannazawadzki
Copy link
Contributor

We should implement an easy way to use uniform partitioning for CV, similar to how we implemented Dartboard.

@briannazawadzki
Copy link
Contributor

See below for the forced (not generalized at all) implementation we used for testing in 2021

Messy random cell cross validation

@briannazawadzki
Copy link
Contributor

KFoldCrossValidatorGridded will need to be generalized or changed, as right now it requires Dartboard and does not allow for other options. We could either rename this to communicate that it's dartboard specific, or we could make a generalized KFoldCrossValidatorGridded which can handle multiple types of partitioning.

@jeffjennings jeffjennings added this to the v0.1.4 milestone Feb 3, 2023
@iancze
Copy link
Collaborator Author

iancze commented Feb 9, 2023

Closing this issue for now, since the main action items (renaming and RandomCell gridding) were implemented by #132 . There are still larger discussions to be had about cross validation strategies (e.g., #93 ) and accuracy (most importantly), but once we progress those discussions a bit further we can open targeted issues for the codebase.

@iancze iancze closed this as completed Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants