Skip to content

PHACDataHub/statscan-phac-diffpriv-collab

Repository files navigation

Command to run pipeline:

  • python experiment-environment/simulator.py
  • python experiment-environment/simulator.py --config_file {file_path}
  • Various outputs along with the cofigurations used are present at: "experiment-environment/aggregation_tests/Aggregation_Prv_W_1"


    What is differential privacy?

    Differential privacy is a mathematical framework for quantifying the privacy of a set of data so that we can guarantee the privacy of the individuals whose sensitive data we want to analyze. Differential privacy aims to strike a balance between providing useful information from a dataset and protecting the privacy of the individuals whose data it contains.

    How does differential privacy work?

    Mathematically speaking, a database is a list of entries where each entry is the set of answers to a collection of questions – like a table where the questions of a survey are the columns and each row is a different person’s response survey response. A privacy mechanism is a function that acts on our database and changes some of the entries so that we can no longer identify a particular individual’s survey response. Roughly speaking, a privacy mechanism is $\epsilon$-differentially private for every pair of databases that differ in only one entry and for all possible outputs of the privacy mechanism and for all adversaries trying to access the data the ratio of the probabilities of getting a certain answer to a query applied to the first dataset and getting that same output from a query applied to the second dataset is very, very close to 1. This means that if we ask the first data set a question we can expect to get a very similar answer to what we would have gotten if we had asked the same answer to a different data set. This can be very useful to us if one of our data sets is provably private!

    What is $\epsilon$?

    The Greek letter “epsilon” is our privacy parameter. It tells us how much the output of a query can change when a single entry of the dataset is changed. We want epsilon to be small, so that adding one persons data to the dataset doesn’t change it in a noticeable way.

    What is the sensitivity of a function?

    The sensitivity of a function is the maximum difference in outputs in applying that function to two datasets that differ in only one entry.

    Noise function (Laplace Mechanism):

    image


    Types of Differential Privacy

  • Global Differential Privacy (GDP)
    In GDP, the privacy guarantee is provided for the dataset as a whole, rather than for individual records within the dataset. This means that any analysis or query performed on the dataset should not reveal sensitive information about any individual participant, even when combined with additional knowledge or external datasets. The mechanism of GDP typically involves adding noise to a query result that is computed over an entire dataset. This noise ensures that the results of the analysis are not overly influenced by any individual’s data and the resulting noisy query cannot be reverse engineered to expose any individual data point, therefore safeguarding individual privacy while still allowing for useful insights to be derived from the data as a whole. Note that since only the query is differentially private and not the individual data points, this implies that the data holder needs to be trusted by the survey participants.
  • Local Differential Privacy (LDP)
    In LDP, noise is added to individual data points before they are shared or analyzed. Unlike GDP, which adds noise to a global query, LDP injects noise at the source of the data, i.e., on the survey participant's device or at the data collection point, before any data is transmitted or aggregated. This perturbation process typically involves adding noise sampled from a known distribution to each data point, making it statistically indistinguishable from similar data points but preserving the overall statistical properties of the dataset. Since the noise is injected before the data reaches the organization that is collecting it, the survey participants do not have to trust the organization.
  • Shuffle Differential Privacy (SDP)
    In Shuffle DP, privacy guarantees are achieved by shuffling the data before analysis thereby breaking any direct link between an individual’s data and their contribution to the dataset. In this model, users generate messages using a local randomizer on their data, similar to the local model. However, in the shuffle model, users trust a central entity to apply a uniformly random permutation on all the messages generated by users. The process typically involves shuffling the order of data points or perturbing the data in a way that masks the identity of individuals while still allowing for meaningful analysis at the aggregate level. This ensures that any analysis performed on the shuffled dataset does not reveal sensitive information about any specific individual, even when combined with external knowledge or additional datasets.

  • Data Used

    We utilized Statistics Canada’s Canadian Community Health Survey (CCHS) Public Use Microdata File (PUMF) dataset. This is an open dataset with a variety of features pertaining to the health of Canadians. It consists of 113,290 responses with 16 columns following a cleaning process. The cleaning process conducted has remapped custom codes used within the columns to be a sequential set of categorical values for discrete features and has maintained continuous values as provided. For example, each response comes from a Canadian province where the GEO_PRV column will contain the value 24 for Quebec and 35 for Ontario. These get remapped to a number between 1 and 13 based on the alphabetical ordering of the 13 provinces.

    The static features of the dataset: image

    Overview of the preprocessed data: image

    Example of general simulation pipeline: image


    Experiment Phase

    Parameters used:

    image

    Comparing province-wise mean height: image

    Comparing Counts of Daily Smokers in Provinces: image

    For in-depth details checkout the document : "PHAC - StatCan DP Research Report.pdf"

    About

    StatsCan-PHAC collaboration on differential privacy

    Resources

    License

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published