Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data from PARSynthesizer does not follow original data distribution #2230

Open
PaudGS opened this issue Sep 18, 2024 · 5 comments
Open
Assignees
Labels
data:sequential Related to timeseries datasets

Comments

@PaudGS
Copy link

PaudGS commented Sep 18, 2024

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.16.1
  • Python version: 3.11.9
  • Operating System: WIndows

Problem description

Trying to create synthetic numeric values using PARSynthesizer returns values very close to the mean of the original distribution, with little variance between values.
The data is a simple table consisting of patient_id(sequence_id), mesure_id, measure_date_time(sequence key) and value of measurement.

The histograms of both distributions look like this:
image

What I already tried

I have tried different epoch values, running with a larger input dataset and the different RDT transforms.
Running the same data with the GaussianCopulaSynthesizer yields much better results, but I would lose the time series aspect of the original data.

Is this the expected behaviour of the PARSynthesizer or am I doing something wrong?

@PaudGS PaudGS added new Automatic label applied to new issues question General question about the software labels Sep 18, 2024
@srinify srinify self-assigned this Sep 19, 2024
@srinify srinify added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 19, 2024
@srinify
Copy link
Contributor

srinify commented Sep 19, 2024

Hi @PaudGS 👋

At the moment, our single table and multi table synthesizers are definitely a bit more mature than PARSynthesizer, our sequential synthesizer. So this difference alone might be causing the shortcoming you're experiencing unfortunately, especially if you've already experimented with different epochs and different transformers.

To rule out a few more things, it would be helpful if you could share your metadata, the column(s) you care the most about, and maybe some sample values that represent the rough distribution (e.g. you can take your original values but scale them by a factor to add a layer of fuzziness). Oh, and also some more context into your use case in general!

This way, I can try to replicate the same distributions on my end, then suggest any possible improvements, and if needed we can document the shortcomings you encountered in a new issue for the team!

@srinify srinify added the data:sequential Related to timeseries datasets label Sep 19, 2024
@PaudGS
Copy link
Author

PaudGS commented Sep 23, 2024

Hi @srinify,
Thanks for the quick response.
About the use case, I am working with data of measurements (blood pressure, heart rate, respiratory rate...) on hospitalized patients. We are looking for ways to automatize the creation of synthetic patients and model their health evolution through hospital stay.
For privacy concerns I have attached manually curated synthetic data of blood pressure measurements instead of the original data, but the results are similar with both datasets.
metadata.json
data.csv

Thanks in advance!

@srinify
Copy link
Contributor

srinify commented Oct 1, 2024

Thanks for sharing this @PaudGS I'll leave this thread open for now so we can collect more examples here from the community and hopefully invest in improving the experience over time!

@srinify srinify removed the under discussion Issue is currently being discussed label Oct 1, 2024
@srinify srinify removed the question General question about the software label Oct 10, 2024
@ardulat
Copy link

ardulat commented Oct 23, 2024

Hi, @srinify! I am experiencing the same issue for a numerical column in the time series sampled with PARSynthesizer. Here is the distribution plot:
Screenshot 2024-10-23 at 4 05 09 PM

I would really appreciate any help you can provide!

@npatki npatki added the under discussion Issue is currently being discussed label Nov 1, 2024
@srinify srinify removed the under discussion Issue is currently being discussed label Nov 1, 2024
@npatki npatki added the under discussion Issue is currently being discussed label Nov 4, 2024
@srinify
Copy link
Contributor

srinify commented Nov 13, 2024

Hi @PaudGS thanks again for sharing some of your data and metadata! I was able to reproduce the issue on my end and can also confirm that changing the transformer didn't help. PAR seems to be creating a very narrow distribution around the mean value. This will be helpful for the team when we're able to work on improving the sequential synthesizer experience.

@npatki npatki removed the under discussion Issue is currently being discussed label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets
Projects
None yet
Development

No branches or pull requests

4 participants