📃 Paper • Data (Yelp/OpenReview/PubMed) • Project Page
This repository implements the Augmented Private Evolution (Aug-PE) algorithm, leveraging inference API access to large language models (LLMs) to generate differentially private (DP) synthetic text without the need for model training. We compare DP-SGD finetuning and Aug-PE:
Under
03/13/2024
: Project page is available, which outlines the algorithm and its results.03/11/2024
: Code and ArXiv paper are available.
conda env create -f environment.yml
conda activate augpe
Datasets are located at data/{dataset}
where dataset
is yelp
, openreview
and pubmed
.
Download the Yelp train.csv
(1.21G) and PubMed train.csv
(117MB) from this link or execute:
bash scripts/download_data.sh # download yelp train.csv and pubmed train.csv
Dataset description:
- Yelp: Processed Yelp dataset from (Yue et al. 2023) with 1.9M reviews for training, 5000 for validation, and 5000 for testing.
- OpenReview: Crawled and processed ICLR 2023 reviews from OpenReview website, with 8396 reviews for training, 2798 for validation, and 2798 for testing.
- PubMed: Abstracts of medical papers in PubMed from 2023/08/01 to 2023/08/07 crawled by (Yu et al. 2023), with 75316 abstracts for training, 14423 for validation, and 4453 for testng.
Pre-compute embeddings for private data (line 1 in Aug-PE algorithm):
bash scripts/embeddings.sh --openreview # Compute private embeddings
bash scripts/embeddings.sh --pubmed
bash scripts/embeddings.sh --yelp
Note: Computing embeddings for OpenReview and PubMed is relatively quick. However, due to Yelp's large dataset size (1.9M training samples), the process may take approximately 40 minutes.
Calculate the DP noise level for your dataset in notebook/dp_budget.ipynb
given the privacy budget
For visualization with Wandb, configure the --wandb_key
and --project
with your key and project name in dpsda/arg_utils.py
.
Utilize open-source LLMs from Hugging Face to generate synthetic data:
export CUDA_VISIBLE_DEVICES=0
bash scripts/hf/{dataset}/generate.sh # Replace `{dataset}` with yelp, openreview, or pubmed
Some key hyperparameters:
noise
: DP noise.epoch
: we use 10 epochs for DP setting. For the non-DP setting, we use 20 epochs for Yelp and 10 epochs for other datasets.model_type
: model on huggingface, such as ["gpt2", "gpt2-medium", "gpt2-large", "meta-llama/Llama-2-7b-chat-hf", "tiiuae/falcon-7b-instruct", "facebook/opt-6.7b", "lmsys/vicuna-7b-v1.5", "mistralai/Mixtral-8x7B-Instruct-v0.1"].num_seed_samples
: number of synthetic samples.lookahead_degree
: number of variations for synthetic sample embedding estimation (line 5 in Aug-PE algorithm). Default is 0 (self-embedding).L
: related to the number of variations to generate candidate synthetic samples (line 18 in Aug-PE algorithm)feat_ext
: embedding model on huggingface sentence-transformers.select_syn_mode
: select synthetic samples according to histogram votes or probability. Default isrank
(line 19 in Aug-PE algorithm)temperature
: temperature for LLM generation.
Finetune the downstream model with DP synthetic text and evaluate the model's accuracy on real test data:
bash scripts/hf/{dataset}/downstream.sh # Finetune downstream model and evaluate performance
Measure the embedding distribution distance:
bash scripts/hf/{dataset}/metric.sh # Calculate distribution distance
For a streamlined process that combines all generation and evaluation steps:
bash scripts/hf/template/{dataset}.sh # Complete workflow for each dataset
We use closed-source model via Azure OpenAI API. Please set your key and endpoint in apis/azure_api.py
MODEL_CONFIG={
'gpt-3.5-turbo':{ "openai_api_key": "YOUR_AZURE_OPENAI_API_KEY",
"openai_api_base": "YOUR_AZURE_OPENAI_ENDPOINT",
"engine": 'YOUR_DEPLOYMENT_NAME',
},
}
Here engine
could be gpt-35-turbo
in Azure.
Run the following script to generate synthetic data, evaluate it on the downstream task, and calculate the embedding distribution distance between real and synthetic data:
bash scripts/gpt-3.5-turbo/{dataset}.sh
We use text-length related prompts for GPT-3.5 to control the length of the generated text. We introduce several additional hyperparameters here:
dynamic_len
is used to enable the dynamic length mechanism.word_var_scale
: Gaussian noise variance used to determine targeted_word.max_token_word_scale
: max number of tokens per word. We set the max_token for LLM generation based on the targeted_word (specified in the prompt) and max_token_word_scale.
Use the notebook to calculate the text length distribution difference between real and synthetic data: notebook/text_lens_distribution.ipynb
If you find our work helpful, please consider citing it as follows:
@inproceedings{
xie2024differentially,
title={Differentially Private Synthetic Data via Foundation Model {API}s 2: Text},
author={Chulin Xie and Zinan Lin and Arturs Backurs and Sivakanth Gopi and Da Yu and Huseyin A Inan and Harsha Nori and Haotian Jiang and Huishuai Zhang and Yin Tat Lee and Bo Li and Sergey Yekhanin},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=LWD7upg1ob}
}
If you have any questions related to the code or the paper, feel free to email Chulin ([email protected]) or open an issue.