Skip to content

Training models & generating data

Vladimir edited this page Feb 7, 2025 · 2 revisions

Now that we have the imputed dataset, training models and generating artificial data is straightforward. Place your file with imputed data imputed_data.csv into the working data folder first and run the following code

import pandas as pd
from synthwave.synthesizer.postimputation.correction import correct_imputed_data

DATA_PATH = "~/Work/data/"

adults = pd.read_csv(DATA_PATH + "imputed_data.csv", dtype_backend="pyarrow")

adults = correct_imputed_data(adults)

This should correct inconsistencies introduced by imputation.

The following step is to set up the models and pass the data to them:

from import Syntets

generator = Syntets(adults)


# load dataset
children = pd.read_parquet(DATA_PATH + "children_non_imputed_middle_fidelity.parquet").drop(columns=["id_person"])

# convert data types
children[["ordinal_person_age", "category_person_ethnic_group"]] = children[["ordinal_person_age", "category_person_ethnic_group"]].astype("uint8[pyarrow]")

# drop households with incomplete records
crooked_records = pd.unique(children[children["category_person_ethnic_group"].isna()]["id_household"])
children = children[~children["id_household"].isin(crooked_records)]
# NOTE do not drop duplicates ever, this destroys twins

generator.train_children(children, verbose = True)

generator.drop_id_columns() # comes here because we need ids to learn how children are formed



This parts allows to get models for generation of new households and children when needed.

Clone this wiki locally