-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want the ID column length should match the given regex pattern #2229
Comments
Hi @Veeresh1996, SDV is designed to ensure that the synthetic data matches (a) the regex format that you provide and (b) the original data type of the real data. In your case, it seems like the two are in conflict with each other: The regex describes having a 6-digit strings, but it appears to me the original data type is an integer. The regex may correctly produce strings such as
|
Hey Neha, Thanks the solution that you have provided works for me.
|
Hi @Veeresh1996 I work with Neha, hope you don't mind me stepping in here.
At the moment, we don't support duplicate values for ID columns -- the only duplicates that will occur is when the ID column is a foreign key column, where duplicate values are a side-effect of a one-to-many relationship.
Here's a quick code snippet you can run with public dataset to see this for yourself! from sdv.datasets.demo import download_demo
from sdv.multi_table import HMASynthesizer
import numpy as np
data, metadata = download_demo(
modality='multi_table',
dataset_name='fake_hotels'
)
# Non-FK and non-PK column set to ID
metadata.update_column(table_name='hotels', column_name='classification', sdtype='id')
# Adding a single NaN to this ID column
data['hotels'].loc[4, 'classification'] = np.nan
synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample()
synthetic_data['hotels']['classification'].isnull().value_counts() The final line of code there will return |
Hi @srinify and @Veeresh1996, just to clarify something on point 1: When you supply a column as an As an example, consider three different id columns: A. A primary key column might be sdtype id In this case, SDV will only enforce that A is unique -- as primary keys must uniquely distinguish every row. SDV will allow B and C to repeat. Just be aware that for C, depending on the regex you provide, it you may need to sample a lot of data in order to see the duplicates. @Veeresh1996 Would you like your ID values (with regex) to repeat at higher frequency? This is definitely a feature request that we can track. We prioritize based on how important it is to your use case, so we would appreciate any more info you are able to provide. |
Environment details
If you are already running SDV, please indicate the following details about the environment in
which you are running it:
Problem description
I am using HMAsynthesizer for Multitables. I am able to generate data with the trained model. But for the columns which I have mentioned as ID's the length of the generated values not matches with the real data even though I have specified the regex pattern. For example, One of the ID column contains 6 digits but the generated output contains some random lengths. Real Data ID Value: 300164 Generated value: 2690
What I already tried
This is the metadata for that specific field,
"patnum": {
"sdtype": "id",
"regex_format": "^\d{6}$"
}
Could you please look into it ASAP? Please let me know if you need any other info
Thanks in advance
The text was updated successfully, but these errors were encountered: