Command to run pipeline:
Various outputs along with the cofigurations used are present at: "experiment-environment/aggregation_tests/Aggregation_Prv_W_1"
What is differential privacy?
Differential privacy is a mathematical framework for quantifying the privacy of a set of data so that we can guarantee the privacy of the individuals whose sensitive data we want to analyze. Differential privacy aims to strike a balance between providing useful information from a dataset and protecting the privacy of the individuals whose data it contains.
How does differential privacy work?
Mathematically speaking, a database is a list of entries where each entry is the set of answers to a collection of questions – like a table where the questions of a survey are the columns and each row is a different person’s response survey response. A privacy mechanism is a function that acts on our database and changes some of the entries so that we can no longer identify a particular individual’s survey response. Roughly speaking, a privacy mechanism is
What is $\epsilon$?
The Greek letter “epsilon” is our privacy parameter. It tells us how much the output of a query can change when a single entry of the dataset is changed. We want epsilon to be small, so that adding one persons data to the dataset doesn’t change it in a noticeable way.
What is the sensitivity of a function?
The sensitivity of a function is the maximum difference in outputs in applying that function to two datasets that differ in only one entry.
Noise function (Laplace Mechanism):
In GDP, the privacy guarantee is provided for the dataset as a whole, rather than for individual records within the dataset. This means that any analysis or query performed on the dataset should not reveal sensitive information about any individual participant, even when combined with additional knowledge or external datasets. The mechanism of GDP typically involves adding noise to a query result that is computed over an entire dataset. This noise ensures that the results of the analysis are not overly influenced by any individual’s data and the resulting noisy query cannot be reverse engineered to expose any individual data point, therefore safeguarding individual privacy while still allowing for useful insights to be derived from the data as a whole. Note that since only the query is differentially private and not the individual data points, this implies that the data holder needs to be trusted by the survey participants.
In LDP, noise is added to individual data points before they are shared or analyzed. Unlike GDP, which adds noise to a global query, LDP injects noise at the source of the data, i.e., on the survey participant's device or at the data collection point, before any data is transmitted or aggregated. This perturbation process typically involves adding noise sampled from a known distribution to each data point, making it statistically indistinguishable from similar data points but preserving the overall statistical properties of the dataset. Since the noise is injected before the data reaches the organization that is collecting it, the survey participants do not have to trust the organization.
In Shuffle DP, privacy guarantees are achieved by shuffling the data before analysis thereby breaking any direct link between an individual’s data and their contribution to the dataset. In this model, users generate messages using a local randomizer on their data, similar to the local model. However, in the shuffle model, users trust a central entity to apply a uniformly random permutation on all the messages generated by users. The process typically involves shuffling the order of data points or perturbing the data in a way that masks the identity of individuals while still allowing for meaningful analysis at the aggregate level. This ensures that any analysis performed on the shuffled dataset does not reveal sensitive information about any specific individual, even when combined with external knowledge or additional datasets.
We utilized Statistics Canada’s Canadian Community Health Survey (CCHS) Public Use Microdata File (PUMF) dataset. This is an open dataset with a variety of features pertaining to the health of Canadians. It consists of 113,290 responses with 16 columns following a cleaning process. The cleaning process conducted has remapped custom codes used within the columns to be a sequential set of categorical values for discrete features and has maintained continuous values as provided. For example, each response comes from a Canadian province where the GEO_PRV column will contain the value 24 for Quebec and 35 for Ontario. These get remapped to a number between 1 and 13 based on the alphabetical ordering of the 13 provinces.
The static features of the dataset:
Overview of the preprocessed data:
Example of general simulation pipeline:
Parameters used:
Comparing province-wise mean height:
Comparing Counts of Daily Smokers in Provinces: