You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: SDG - Generate Synthetic Data through SMOTE/README.md
+52
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,20 @@ This video (click on below image to play) provides a basic idea:
39
39
3.[hnswlib](https://pypi.org/project/hnswlib/)
40
40
4.[protobuf](https://pypi.org/project/protobuf/)
41
41
42
+
### (OPTIONAL) Prerequisites for Singling Out Risk calculation
43
+
44
+
If you want to measure singling out risk (provided as an option in this step), note the following additional prerequisites:
45
+
46
+
1. SAS compute session should be configured to access a Python runtime of version of > 3.7 and < 3.12.
47
+
48
+
2. A Python package - [anonymeter](https://pypi.org/project/anonymeter/) - should be installed in the above runtime. Make note of details about anonymeter at https://pypi.org/project/anonymeter/
49
+
50
+
3. As a further dependency on above, anonymeter requires NumPy between version 1.2 and less than 1.7 (specifically, "numpy >=1.22, <1.27", # limited by Numba support)
51
+
52
+
Note terms of anonymeter license here: https://github.com/statice/anonymeter/blob/main/LICENSE.md
53
+
54
+
Note citation in [Privacy Risk](#privacy-risk) section below.
55
+
42
56
-----
43
57
## Parameters
44
58
----
@@ -55,6 +69,40 @@ This video (click on below image to play) provides a basic idea:
55
69
5. Select a class column (column selector, optional): select a column if you wish to use SMOTE in order to balance or augment a level within the class column. Be judicious in the choice of this column since a column with a high number of levels may slow down or even fail the process. Your class column is required to be in the inputs column list.
56
70
57
71
6. Class to augment (drop-down list, values from class column if selected): select the level of the class variable you wish to augment. The values that appear here depend on the data that's contained in the class column, so may take time to populate based on actual data and number of levels.
72
+
----
73
+
### Privacy Risk
74
+
Synthetic data requires assurances on data privacy. One aspect of privacy risk is singling out risk, which evolved alongside General Data Protection Regulation (GDPR). **This is an optional step.** If you wish to measure singling out risk, enter the parameters below.
75
+
76
+
1.**Measure Singling Out Risk** (check box, default not checked): select this option if you want to measure singling out risk. Be aware of the Python dependencies (in Prerequisites section) and the fact that this involves a longer runtime in addition to the generation operation.
77
+
78
+
2.**Evaluation mode** (drop-down list): select either univariate or multivariate to define the type of attack query to be tested.
79
+
80
+
3.**Confidence interval** (percentage, numeric stepper): select a number from 90 to 99 to define the confidence level while providing privacy risk estimates.
81
+
82
+
4.**Number of attacks** (numeric stepper, default 100) : enter number of attacks (queries) to simulate.
83
+
84
+
5.**Singling Out Risk Results table** (output port): attach a CAS table to the so_results_tbl output port to hold results.
85
+
86
+
6.**Singling Out Risk Queries table** (output port): attach a CAS table to the so_queries_tbl output port to hold results.
87
+
88
+
#### Citation for anonymeter
89
+
90
+
As we make use of an open-source package, anonymeter, to perform these calculations, we note the following citation:
91
+
92
+
"A Unified Framework for Quantifying Privacy Risk in Synthetic Data", M. Giomi et al, PoPETS 2023.
93
+
94
+
This bibtex entry refers to the paper:
95
+
96
+
```
97
+
@misc{anonymeter,
98
+
doi = {https://doi.org/10.56553/popets-2023-0055},
0 commit comments