-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disco preprocessing is too constraining #649
Comments
#781 may allow more flexibility for preprocessing |
it should, it should. answering based on a comment you made there and some other comments
totally! for now, how I see the usage of discojs is as follow:
in the Disco case, I'm thinking of throwing now, letting the user know that the dataset is missing important values rather than choosing ourself how to handle it (currently, titanic's is missing some values).
I got scared a while back with non-streaming algorithms as the size of dataset can be quite huge. but with #781, it's not an issue anymore! one can compute whatever they want on a dataset fillEmptyString(dataset, await computeMean(dataset, 'column')) note that it requires one full passage over the dataset for
that kinda ties to the one before: we can do wathever now with processing, it's simply function applied on top of datasets. also, tell me if I missed smth but IMO finding the distribution (mean/stddev) of a dataset is inherently skewed, one would need to know the parameters over the whole population rather than on a specific slice, no? |
Yes ideally features are normalized according to the overall population statistics but in practice it's almost never known so it is common to use the empirical estimates. One thing important though is that in collaborative learning, each data owner has a different subset of the data so normalizing according to the local subset may yield very different results than normalizing using all the data. @rabbanitw do you know if there are some common mechanisms for feature preprocessing in federated/decentralized learning? Or is there simply no preprocessing? FedProx (#802) alleviates the problem of data heterogeneity at the network level but in the case of tabular data for example, having features with very different magnitudes can make the local training diverge (e.g. #615) Edit: it has been an issue in Disco for a long time #32 |
A recent refactoring enforced tasks' preprocessing in a lazy and streaming fashion. The preprocessing first defines stateless preprocessing functions (e.g.
resize
andnormalize
for images) and then apply them successively on the dataset one row at a time:Limitations to address:
The text was updated successfully, but these errors were encountered: