Skip to content

Commit d23e044

Browse files
authored
Merge pull request #192 from SundareshSankaran/192-add-privacy-risk-measure-for-smote
Update Custom Step : Add privacy risk measure for SMOTE (SDG)
2 parents 48ca016 + 5d9be5d commit d23e044

File tree

3 files changed

+363
-10
lines changed

3 files changed

+363
-10
lines changed

Diff for: SDG - Generate Synthetic Data through SMOTE/README.md

+52
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,20 @@ This video (click on below image to play) provides a basic idea:
3939
3. [hnswlib](https://pypi.org/project/hnswlib/)
4040
4. [protobuf](https://pypi.org/project/protobuf/)
4141

42+
### (OPTIONAL) Prerequisites for Singling Out Risk calculation
43+
44+
If you want to measure singling out risk (provided as an option in this step), note the following additional prerequisites:
45+
46+
1. SAS compute session should be configured to access a Python runtime of version of > 3.7 and < 3.12.
47+
48+
2. A Python package - [anonymeter](https://pypi.org/project/anonymeter/) - should be installed in the above runtime. Make note of details about anonymeter at https://pypi.org/project/anonymeter/
49+
50+
3. As a further dependency on above, anonymeter requires NumPy between version 1.2 and less than 1.7 (specifically, "numpy >=1.22, <1.27", # limited by Numba support)
51+
52+
Note terms of anonymeter license here: https://github.com/statice/anonymeter/blob/main/LICENSE.md
53+
54+
Note citation in [Privacy Risk](#privacy-risk) section below.
55+
4256
-----
4357
## Parameters
4458
----
@@ -55,6 +69,40 @@ This video (click on below image to play) provides a basic idea:
5569
5. Select a class column (column selector, optional): select a column if you wish to use SMOTE in order to balance or augment a level within the class column. Be judicious in the choice of this column since a column with a high number of levels may slow down or even fail the process. Your class column is required to be in the inputs column list.
5670

5771
6. Class to augment (drop-down list, values from class column if selected): select the level of the class variable you wish to augment. The values that appear here depend on the data that's contained in the class column, so may take time to populate based on actual data and number of levels.
72+
----
73+
### Privacy Risk
74+
Synthetic data requires assurances on data privacy. One aspect of privacy risk is singling out risk, which evolved alongside General Data Protection Regulation (GDPR). **This is an optional step.** If you wish to measure singling out risk, enter the parameters below.
75+
76+
1. **Measure Singling Out Risk** (check box, default not checked): select this option if you want to measure singling out risk. Be aware of the Python dependencies (in Prerequisites section) and the fact that this involves a longer runtime in addition to the generation operation.
77+
78+
2. **Evaluation mode** (drop-down list): select either univariate or multivariate to define the type of attack query to be tested.
79+
80+
3. **Confidence interval** (percentage, numeric stepper): select a number from 90 to 99 to define the confidence level while providing privacy risk estimates.
81+
82+
4. **Number of attacks** (numeric stepper, default 100) : enter number of attacks (queries) to simulate.
83+
84+
5. **Singling Out Risk Results table** (output port): attach a CAS table to the so_results_tbl output port to hold results.
85+
86+
6. **Singling Out Risk Queries table** (output port): attach a CAS table to the so_queries_tbl output port to hold results.
87+
88+
#### Citation for anonymeter
89+
90+
As we make use of an open-source package, anonymeter, to perform these calculations, we note the following citation:
91+
92+
"A Unified Framework for Quantifying Privacy Risk in Synthetic Data", M. Giomi et al, PoPETS 2023.
93+
94+
This bibtex entry refers to the paper:
95+
96+
```
97+
@misc{anonymeter,
98+
doi = {https://doi.org/10.56553/popets-2023-0055},
99+
url = {https://petsymposium.org/popets/2023/popets-2023-0055.php},
100+
journal = {Proceedings of Privacy Enhancing Technologies Symposium},
101+
year = {2023},
102+
author = {Giomi, Matteo and Boenisch, Franziska and Wehmeyer, Christoph and Tasnádi, Borbála},
103+
title = {A Unified Framework for Quantifying Privacy Risk in Synthetic Data},
104+
}
105+
```
58106

59107

60108
----
@@ -118,6 +166,7 @@ IMPORTANT: Be aware that disabling this step means that none of its main executi
118166

119167
3. PyPi page for [hnswlib](https://pypi.org/project/hnswlib/)
120168
4. PyPi page for [protobuf](https://pypi.org/project/protobuf/)
169+
5. PyPi page for [anonymeter](https://pypi.org/project/anonymeter/)
121170

122171
----
123172
## SAS Program
@@ -133,6 +182,7 @@ Refer [here](./extras/SDG_SMOTE_Synthetic_Data.sas) for the SAS program used by
133182
## Created/contact:
134183

135184
- Sundaresh Sankaran ([email protected])
185+
- Josiah Chua ([email protected])
136186

137187
Acknowledgements to others for their help on details, testing or exploring the area:
138188
- David Olaleye ([email protected])
@@ -143,6 +193,8 @@ Acknowledgements to others for their help on details, testing or exploring the a
143193
----
144194
## Change Log
145195

196+
* Version 1.3.1 (10DEC2024)
197+
* Add calculation for privacy risk (singling out risk)
146198
* Version 1.2 (11NOV2024)
147199
* Add provenance flag and sampling for assessment
148200
* Version 1.1 (02NOV2024)

Diff for: SDG - Generate Synthetic Data through SMOTE/SDG - Generate Synthetic Data through SMOTE.step

+1-1
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)