Skip to content
conorhenley edited this page Jun 30, 2014 · 5 revisions

This implementation of PopGen remains consistent with previous implementation it requires data inputs in csv format. There are five required input tables: household marginals, household sample, person marginals, person sample, and geographic correspondence. Eventually, it's possible that group quarters tables will be allowed as optional inputs.

  • household marginals: this table contains household counts for each control variable category for each control geography (tract, block group, taz). It should include a geographic identifier column and a column for each control variable category (i.e. if the user is controlling for income using three categories, this table should include 'income_1','income_2','income_3' columns). The values in the control variable category columns represent the number of households in the geography that fall within each respective category.

  • household sample: this table contains a sample of individual household records. This sample is usually extracted from Census Public Use Microdata Sample (PUMS) data, a roughly three percent sample with households linked to the PUMA (Public Use Microdata Area). This table should contain a geographic identifier column (usually the PUMA ID), a PUMS record serial number column, a serial household ID column (1...n), and one column for each of the variables to be controlled for in the synthesis process. For example, if there are three income variable categories, the sample table should include an 'income' column and each household record should have a value of 1,2, or 3 depending on into which category it falls.

  • person marginals: this table contains person counts for each person-level control variable category for each control geography. It should include a geographic identifier column and a column for each control variable category (i.e. if the user is controlling or age using five categories, this table should include 'age_1' through 'age_5' columns). The values in the control variable category columns represent the number of persons in the geography that fall within each category.

  • person sample: this table contains a sample of individual person records. This sample also comes from Census Public Use Microdata Sample (PUMS) data. This table should contain a geographic identifier column (usually the PUMA ID), a PUMS household record serial number column, a household ID column (both serial number and household id should match the household record with which the person record is attached), and one column for each of the variables to be controlled for in the synthesis process. For example, if there are five age variable categories, the sample table should include an 'age' column and each person record should have a value of 1,2,3,4, or 5 depending on into which category it falls.

  • geographic correspondence: because the control variables and sample record data almost always come at different geographies, a correspondence table is necessary. This table only requires two columns, a control variable geography id column (tract, block group, taz) and a sample geography column (puma_id). This determines which sample records are used to construct the synthetic population for each control geography.

Clone this wiki locally