CuTS: Customizable Tabular Synthetic Data Generation

This is the official codebase of the ICML 2024 paper CuTS: Customizable Tabular Synthetic Data Generation. In case of any questions, feel free to raise an issue on GitHub or contact the corresponding author per email: Mark Vero, [email protected].

Installation

For ease of installation, we provide a conda environment in environment.yml, which can be installed with the following command:

conda env create -f environment.yml

The environment can be activated using the command:

conda activate cuts

Example usage of CuTS

We provide a minimal example of a CuTS synthetization, executable by running the provided example.py script.

from customizable_synthesizer import CuTS


program = '''
SYNTHESIZE: Adult;

    ENFORCE: STATISTICAL:  
        E[age|age > 30] == 40;
    
END;
'''    

cuts = CuTS(program)
cuts.fit(verbose=True)

syndata = cuts.generate_data(30000)

Reproducing the experiments in the paper

Datasets

All raw data files are included in the repository, except for the files of the Health Heritage Prize dataset, as it is over the size limit of GitHub. The required raw data for the Health Heritage dataset can be downloaded from here. Please, download the files and place them on the path tabular_datasets/Health_Heritage.

To fix the train-test split of the datasets, run

python split_datasets.py --dataset <dataset_name>

for each dataset, keeping other parameters fixed. For the Health Heritage Prize and the Compas datasets, run the above command additionally also with the flag --binary_fairness.

Experiments

All single-constraint experiments can be reproduced by running the python script run_constraint_program_benchmark.py, specifying the --dataset (see list of dataset names in the script), --workload (default for non-DP and all_three for DP experiments), and --epsilon (default for non-DP and 1.0 for DP experiments) arguments.

All constraint-combination experiments can be reproduced by running the python script run_ablation.py, specifying the --dataset, and --option (1 for mixed constraints 2 for only logical constraints) arguments.

Each of the above setups can be also run with the flag --baseline_mode, where the raw, unconstrained model is benchmarked.

Note that on the first execution at specific privacy level on a given dataset, the backbone model is trained. This may take up to a few hours. However, afterwards, this backbone is saved, and any other experiments on the same dataset and privacy level will load and fine tune this model, not retraining it, saving considerable computation time. Note that a slight random difference in the backbone model may cause your results to non-significantly deviate from the ones presented in the paper.

[17.03.2025] Update The backbones used in the paper are now available for download here. Please place the whole folder under experiment_data.

All experimental results are saved in .npy files in the folder experiment_data, and include 6 metrics (three statistical similarity metrics and 3 downstream classifier performance metrics). The layout of the .npy tensor can be inspected in the corresponding experimental scripts. To reproduce the results shown in the paper, average over the sample and resample dimensions of the mean results for each metric.

Citation

@inproceedings{vero2024cuts,
    title={Cu{TS}: Customizable Tabular Synthetic Data Generation},
    author={Mark Vero and Mislav Balunovic and Martin Vechev},
    booktitle={Forty-first International Conference on Machine Learning},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
classification_models		classification_models
constraints		constraints
customizable_synthesizer		customizable_synthesizer
denoiser		denoiser
differential_privacy		differential_privacy
evaluation		evaluation
experiment_data		experiment_data
query		query
tabular_datasets		tabular_datasets
utils		utils
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
example.py		example.py
run_ablation.py		run_ablation.py
run_constraint_program_benchmark.py		run_constraint_program_benchmark.py
split_datasets.py		split_datasets.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CuTS: Customizable Tabular Synthetic Data Generation

Installation

Example usage of CuTS

Reproducing the experiments in the paper

Datasets

Experiments

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

eth-sri/cuts

Folders and files

Latest commit

History

Repository files navigation

CuTS: Customizable Tabular Synthetic Data Generation

Installation

Example usage of CuTS

Reproducing the experiments in the paper

Datasets

Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages