Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAR Diagnostic is not 1.0 for datetime context columns #2018

Open
npatki opened this issue May 20, 2024 · 1 comment
Open

PAR Diagnostic is not 1.0 for datetime context columns #2018

npatki opened this issue May 20, 2024 · 1 comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets

Comments

@npatki
Copy link
Contributor

npatki commented May 20, 2024

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.12.0
  • Python version: 3.12
  • Operating System: Linux

Error Description

As originally described by @Ng-ms in #2004: When there was a datetime context column, the min/max bounds for the synthesized data were outside the observed range from the real data. This is causing the BoundaryAdherence score to be <1.0 for that context column.

Steps to reproduce

Note that the dataset is not available for privacy reasons. The SDV team will try to replicate this with SDV demo data.

min_max_scaler = MinMaxScaler()
df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns])
df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce')
df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int)
metadata.set_sequence_index(column_name='visit_date')
synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None)

Diagnostic score output:
image

For this issue let's just focus on the fact that context column pre_date has a score <1.0. There is a separate issue for the sequence index visit_date.

@npatki npatki added bug Something isn't working new Automatic label applied to new issues data:sequential Related to timeseries datasets and removed new Automatic label applied to new issues labels May 20, 2024
@srinify
Copy link
Contributor

srinify commented May 31, 2024

I'm not able to reproduce this issue using our demo datasets (or even using randomly generated data).

I'll leave this issue open if someone is able to come along and share code to help us reproduce this issue! @Ng-ms

@srinify srinify self-assigned this Jun 10, 2024
@srinify srinify removed their assignment Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets
Projects
None yet
Development

No branches or pull requests

2 participants