Felix Grimberg @ MLO EPFL
Read everything about this project:
- Short version in this conference paper
- Extended version in the report (also on overleaf).
In this repo, we investigate methods for collaborative training of personalized ML models based on the distance between gradients in SGD-based optimization algorithms.
To view correctly displayed formulas directly in GitHub, open the IPYNB version of this README!
It's updated nightly to mirror the contents of the Markdown file.
data-preprocessing.ipynb
: This notebook handles the pre-processing of the Ebola dataset. Obtain the dataset csv
file from Mary-Anne Hartley and place it in a folder named ../data/private
(relative to the repo folder). Then, execute all cells of data-preprocessing.ipynb
in order to deal with missing values. The pre-processing is largely copied from Annie's 2017 paper on Predicting Ebola Infection.
global-model-1.ipynb
: This notebook creates the feature matrix scikit-learn
.
Titanic-preprocessing.ipynb
: This notebook handles the loading and pre-processing of the Titanic dataset. Install tensorflow_datasets
and execute all cells of Titanic-preprocessing.ipynb
to produce the clean data files used in the federated learning simulations. Many ideas in this notebook are inspired from https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python.
Jax_federated_learning.ipynb
: This notebook is the heart of the project. It implements the Weight Erosion aggregator in an earlier version of Sai Praneeth Karimireddy's Jax-based FL simulation framework (GitHub: @saipraneet). Jump to the Markdown cell titled Run a federated learning simulation to specify all simulation parameters, then execute all cells in order (from the top) to run the simulation and output its results.
modules
: This folder contains two Python
files with functions and classes used in Jax_federated_learning.ipynb
.
- Automate hyper-parameter tuning and use cross-validation
- Split the data set in more different ways
- IID
- Vary number of clients
- By gender
- By outcome (label skew)
- skew
$\mathbb{P} \left( \mathbf{x} | y \right)$ - skew
$\mathbb{P} \left( y | \mathbf{x} \right)$
- Test the gradient-based method on a different task (maybe more parameters and much larger data set, e.g. from medical imaging?)
- Implement other methods (e.g., based on influence)
- Refine gradient-based similarity some more, try it out on other problems
- Rigorously define similarity (e.g., with learning theory notion of discrepancy)
We consider a network of participants
As an example, the participants could be individual hospitals dispersed across one or several regions. Doctors could have reasons to expect that the same set of symptoms is more strongly associated with one diagnosis for the patients of one hospital, and with another diagnosis for the patients of a different hospital. As a layman example, it seems conceivable that diseases caused by poor sanitation are more likely to occur in poor rural regions, while drug abuse is more frequent in more affluent urban settings - while causing similar symptoms. This is not to say that the symptoms are indeed similar. Knowing this, a medical professional from one hospital might want to train a diagnostic model, recognizing that they would benefit from enhancing the limited number of samples collected at their own hospital with samples collected at a selected subset of other hospitals.
Why we investigate methods based on the distance between gradients in SGD-based optimization algorithms:
We say that a distribution
- We might want to define a metric for that notion of similarity
$\uparrow$
Recall that, given an inference task defined by:
- a class of models
$\mathbb{M}$ and - a loss function
$\mathcal{L} (y, \hat{y})$ ,
the true model $\mathcal{M}{i}^{true}$ minimizes the expected loss $\mathcal{R}{\mathcal{D}_i} \left( \mathcal{M} \right)$ on the distribution
$ \mathcal{R}{\mathcal{D}i} \left( \mathcal{M}{i}^{true} \right) \triangleq \underset{(\mathbf{x}, y) \in \mathcal{D}i}{E} \left[ \mathcal{L} \left( y, \mathcal{M}{i}^{true} \left( \mathbf{x} \right) \right) \right] \leq \mathcal{R}{\mathcal{D}i} \left( \mathcal{M}{i}^{other} \right) \forall \mathcal{M}_{i}^{other} \in \mathbb{M} $
Let
$ \textbf{min} \left( \mathcal{R}_{\mathcal{D}i} \left( \mathcal{M}^{tentative} \right), \mathcal{R}{\mathcal{D}j} \left( \mathcal{M}^{tentative} \right) \right) \gg \textbf{max} \left( \mathcal{R}{\mathcal{D}_i} \left( \mathcal{M}^{true}j \right), \mathcal{R}{\mathcal{D}_j} \left( \mathcal{M}^{true}_i \right) \right)$
Further, let
- We might want to prove that
$\uparrow$ , for which we would also need to be more rigorous about what we mean by the two gradients being similar. - We haven't shown that the gradients of two more similar distributions will be more similar on average than the gradients of two less similar distributions.
- Here we assume that
$\mathcal{M}^{tentative}$ performs very suboptimally on both distributions. What happens as$\mathcal{M}^{tentative}$ approaches, let's say, the true model of the joint distribution? Clearly, the gradients will begin to diverge. Can we formalize this in some way? In fact, I believe that the true model of the joint distribution is such that the sum$\mathbf{g}_i + \mathbf{g}_0$ adds up to$\mathbf{0}$ in expectation. - Related to the previous point: Maybe the aggregated similarity estimator (cf. below) should experience some form of decay, if we expect the (expected) gradients to diverge as the model gets better? But then, if we do enough SGD steps, all similarities would end up decaying at some point and we would eventually end up with the local model again. So where do we find the compromise here?
- Is there a canonical symbol to denote the true error of a model? That's the correct name for what I'm defining here, right?
- I'm almost certain that true model is not the correct term for the model that minimizes the true error over all models in a class of models. What's the correct term?
- We might want to distinguish between the norm and the angle of the gradients. Though a gradient that points very far or not-far-at-all in the right direction still is not useful and certainly does not hint at similar underlying distributions.
It is therefore our intention to invent an estimator for the (so far vaguely defined) similarity of two distributions based on the distance between gradients in SGD-based optimization algorithms involving samples taken from these two distributions. It is clear that this estimator must be aggregated over a number of SGD or minibatch steps due to the random nature of the SGD / minibatch gradient.
This whole thing about similar distributions leading to similar gradients holds in expectation, but the individual gradients can randomly vary a lot if the minibatch size is small. It is thus clear that we must come up with a similarity estimator that's aggregated over time.
The following could be a rough sketch of the algorithm:
- Perform a certain number of steps with all participants until we've aggregated a stable enough similarity estimator.
- Select a subset of the available data sets, potentially weighing the contribution of each data set according to the estimated similarity of its underlying distribution.
- Train the model on the selected data sets until some stopping criterion (convergence, number of communication rounds, ...) is reached. Step 2 could potentially be repeated at regular intervals throughout step 3.
There is a balance to be struck between the number of communication rounds needed to perform steps 1 & 3 satisfactorily, and the requirement of communication efficiency.
I don't know whether weight quantization and model sparsification are viable strategies to achieve communication efficiency in the training of linear models.