Disco preprocessing is too constraining #649

JulienVig · 2024-03-12T13:18:18Z

A recent refactoring enforced tasks' preprocessing in a lazy and streaming fashion. The preprocessing first defines stateless preprocessing functions (e.g. resize and normalize for images) and then apply them successively on the dataset one row at a time:

// The preprocessing functions are applied as a map on the dataset:
this.dataset.map(this.preprocessing)

Limitations to address:

This functional preprocessing doesn't allow for "stateful" preprocessing. Because of its streaming nature, it is currently not possible to normalize a tabular column since we can't compute general aggregations (e.g. mean and standard deviation of a feature). In other words, if we want to implement new preprocessing functions we can only add functions with one dataset row as sole argument which is very constraining.
The preprocessing state learned during training should be saved to be re-used for test and inference. For example, standardizing the testing set should be done with the training set's mean and standard deviation, not the test's statistics. Therefore, the preprocessing state should be saved and this is currently not supported.

The text was updated successfully, but these errors were encountered:

JulienVig · 2024-09-26T11:58:33Z

#781 may allow more flexibility for preprocessing

tharvik · 2024-10-03T11:41:31Z

#781 may allow more flexibility for preprocessing

it should, it should. answering based on a comment you made there and some other comments

choosing how to handle missing values [in tabular] should be the responsibility of the data owner so we could throw an error or drop the whole row as default behavior.

totally! for now, how I see the usage of discojs is as follow:

for easy access, use Disco and let it transform data via the preprocess function
- here, we can be opiniated, and actively choose what the behavior should be (default/throw/drop/…)
- datatype specific processing are set in Task
for more advance usage, direclty use Trainer and process the data yourself
- it should be quite straightforward to do so yet allow any error handling

in the Disco case, I'm thinking of throwing now, letting the user know that the dataset is missing important values rather than choosing ourself how to handle it (currently, titanic's is missing some values).

This functional preprocessing doesn't allow for "stateful" preprocessing. Because of its streaming nature, it is currently not possible to normalize a tabular column since we can't compute general aggregations (e.g. mean and standard deviation of a feature). In other words, if we want to implement new preprocessing functions we can only add functions with one dataset row as sole argument which is very constraining.

I got scared a while back with non-streaming algorithms as the size of dataset can be quite huge. but with #781, it's not an issue anymore! one can compute whatever they want on a dataset

fillEmptyString(dataset, await computeMean(dataset, 'column'))

note that it requires one full passage over the dataset for computeMean which I find costly.

The preprocessing state learned during training should be saved to be re-used for test and inference. For example, standardizing the testing set should be done with the training set's mean and standard deviation, not the test's statistics. Therefore, the preprocessing state should be saved and this is currently not supported.

that kinda ties to the one before: we can do wathever now with processing, it's simply function applied on top of datasets.

also, tell me if I missed smth but IMO finding the distribution (mean/stddev) of a dataset is inherently skewed, one would need to know the parameters over the whole population rather than on a specific slice, no?
(maybe a new parameters to Task in this case?)

JulienVig · 2024-10-21T11:01:03Z

also, tell me if I missed smth but IMO finding the distribution (mean/stddev) of a dataset is inherently skewed, one would need to know the parameters over the whole population rather than on a specific slice, no?

Yes ideally features are normalized according to the overall population statistics but in practice it's almost never known so it is common to use the empirical estimates.

One thing important though is that in collaborative learning, each data owner has a different subset of the data so normalizing according to the local subset may yield very different results than normalizing using all the data.

@rabbanitw do you know if there are some common mechanisms for feature preprocessing in federated/decentralized learning? Or is there simply no preprocessing?

FedProx (#802) alleviates the problem of data heterogeneity at the network level but in the case of tabular data for example, having features with very different magnitudes can make the local training diverge (e.g. #615)

Edit: it has been an issue in Disco for a long time #32

JulienVig added the discojs Related to Disco.js label Mar 12, 2024

JulienVig mentioned this issue Sep 26, 2024

*: framework agnostic preprocessing #781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disco preprocessing is too constraining #649

Disco preprocessing is too constraining #649

JulienVig commented Mar 12, 2024 •

edited

Loading

JulienVig commented Sep 26, 2024

tharvik commented Oct 3, 2024 •

edited

Loading

JulienVig commented Oct 21, 2024 •

edited by tharvik

Loading

Disco preprocessing is too constraining #649

Disco preprocessing is too constraining #649

Comments

JulienVig commented Mar 12, 2024 • edited Loading

JulienVig commented Sep 26, 2024

tharvik commented Oct 3, 2024 • edited Loading

JulienVig commented Oct 21, 2024 • edited by tharvik Loading

JulienVig commented Mar 12, 2024 •

edited

Loading

tharvik commented Oct 3, 2024 •

edited

Loading

JulienVig commented Oct 21, 2024 •

edited by tharvik

Loading