Skip to content

Training models & generating data

Vladimir edited this page Feb 7, 2025 · 2 revisions

Now that we have the imputed dataset, training models and generating artificial data is straightforward. Place your file with imputed data imputed_data.csv into the working data folder first and run the following code

import pandas as pd
from synthwave.synthesizer.postimputation.correction import correct_imputed_data

DATA_PATH = "~/Work/data/"

adults = pd.read_csv(DATA_PATH + "imputed_data.csv", dtype_backend="pyarrow")

adults = correct_imputed_data(adults)

This should correct inconsistencies introduced by imputation.

The following step is to set up the models and pass the data to them:

from synthwave.synthesizer.uk.generator import Syntets

generator = Syntets(adults)

generator.split_data()
generator.restructure_data()

# load dataset
children = pd.read_parquet(DATA_PATH + "children_non_imputed_middle_fidelity.parquet").drop(columns=["id_person"])

# convert data types
children[["ordinal_person_age", "category_person_ethnic_group"]] = children[["ordinal_person_age", "category_person_ethnic_group"]].astype("uint8[pyarrow]")

# drop households with incomplete records
crooked_records = pd.unique(children[children["category_person_ethnic_group"].isna()]["id_household"])
children = children[~children["id_household"].isin(crooked_records)]
# NOTE do not drop duplicates ever, this destroys twins

generator.train_children(children, verbose = True)

generator.drop_id_columns() # comes here because we need ids to learn how children are formed


generator.locate_degenerate_distributions()

generator.convert_types()
generator.init_models(_epochs=5)
generator.attach_constraints()
generator.train()

This parts allows to get models for generation of new households and children when needed.

Clone this wiki locally