Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want the ID column length should match the given regex pattern #2229

Open
Veeresh1996 opened this issue Sep 17, 2024 · 4 comments
Open

I want the ID column length should match the given regex pattern #2229

Veeresh1996 opened this issue Sep 17, 2024 · 4 comments
Labels
question General question about the software under discussion Issue is currently being discussed

Comments

@Veeresh1996
Copy link

Veeresh1996 commented Sep 17, 2024

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.14.0
  • Python version: 3.12.3
  • Operating System: Windows 11

Problem description

I am using HMAsynthesizer for Multitables. I am able to generate data with the trained model. But for the columns which I have mentioned as ID's the length of the generated values not matches with the real data even though I have specified the regex pattern. For example, One of the ID column contains 6 digits but the generated output contains some random lengths. Real Data ID Value: 300164 Generated value: 2690

What I already tried

This is the metadata for that specific field,
"patnum": {
"sdtype": "id",
"regex_format": "^\d{6}$"
}
Could you please look into it ASAP? Please let me know if you need any other info

Thanks in advance

@Veeresh1996 Veeresh1996 added new Automatic label applied to new issues question General question about the software labels Sep 17, 2024
@npatki
Copy link
Contributor

npatki commented Sep 17, 2024

Hi @Veeresh1996, SDV is designed to ensure that the synthetic data matches (a) the regex format that you provide and (b) the original data type of the real data. In your case, it seems like the two are in conflict with each other: The regex describes having a 6-digit strings, but it appears to me the original data type is an integer.

The regex may correctly produce strings such as "002690" but when converted to an integer, this will become 2690 (no longer 6 characters). So the regex is not really compatible with the data type. To fix this issue, you would have to address root cause of the mismatch.

  • You could either update the regex format to ensure that leading 0s are not possible. Eg. Enforce that the first digit cannot be 0: [1-9]\d{5} or
  • Convert the real data to strings before passing it into SDV so that the synthetic data output will also be strings.

@npatki npatki added under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Sep 17, 2024
@Veeresh1996
Copy link
Author

Veeresh1996 commented Sep 17, 2024

Hey Neha, Thanks the solution that you have provided works for me.

  1. Is it possible to generate duplicate values in id columns? For example I want a six digit value and it is ok to have duplications of the value in same field.
  2. I have null values in one of my id column (which is not a primary key or foreign key but just a unique value), I just want to generate same kind of data with unique and null values in the respective field. How can I achieve that?

@srinify
Copy link
Contributor

srinify commented Sep 25, 2024

Hi @Veeresh1996 I work with Neha, hope you don't mind me stepping in here.

  1. The purpose of an ID column is to uniquely identify rows. In the multi-table context, SDV:
  • will generate a unique ID value for every row that is both set to the id sdtype and also set as the primary key
  • could generate multiple rows with the same ID value in the child table, because it's the foreign key and the cardinality is trying to be mirrored from the real data

At the moment, we don't support duplicate values for ID columns -- the only duplicates that will occur is when the ID column is a foreign key column, where duplicate values are a side-effect of a one-to-many relationship.

  1. SDV will learn the null values in your regular ID columns and try to recreate that ratio in the synthetic data as well.

Here's a quick code snippet you can run with public dataset to see this for yourself!

from sdv.datasets.demo import download_demo
from sdv.multi_table import HMASynthesizer
import numpy as np

data, metadata = download_demo(
    modality='multi_table',
    dataset_name='fake_hotels'
)

# Non-FK and non-PK column set to ID
metadata.update_column(table_name='hotels', column_name='classification', sdtype='id')

# Adding a single NaN to this ID column
data['hotels'].loc[4, 'classification'] = np.nan

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample()
synthetic_data['hotels']['classification'].isnull().value_counts()

The final line of code there will return False: 9, True: 1, which corresponds to 1 NaN value in the synthetic data for that column as well.

@npatki
Copy link
Contributor

npatki commented Sep 26, 2024

Hi @srinify and @Veeresh1996, just to clarify something on point 1: When you supply a column as an id, it just means that the column represents a label that can help you identify a concept. It does not always have to be unique. (This blog post has some useful information.)

As an example, consider three different id columns:

A. A primary key column might be sdtype id
B. A foreign key column might be sdtype id
C. You may have a generic column that is sdtype id (neither a primary nor foreign key)

In this case, SDV will only enforce that A is unique -- as primary keys must uniquely distinguish every row. SDV will allow B and C to repeat. Just be aware that for C, depending on the regex you provide, it you may need to sample a lot of data in order to see the duplicates.

@Veeresh1996 Would you like your ID values (with regex) to repeat at higher frequency? This is definitely a feature request that we can track. We prioritize based on how important it is to your use case, so we would appreciate any more info you are able to provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software under discussion Issue is currently being discussed
Projects
None yet
Development

No branches or pull requests

3 participants