09-no-microdata.Rmd

# Population synthesis without microdata {#nomicrodata}

Sometimes no representative individual level
dataset is available as an input for population synthesis.
In this case, the methods described in the previous
chapters must be adapted accordingly.
The challenge
is still to generate spatial microdata 
that fits all the constraint tables, but based on a purely synthetic
'seed' input cross-tabulated contingency table. Many
combinations of individual level data
could correspond to these distributions. Depending on the aim
of the spatial microsimulation model, simply finding one reasonable fit
can be sufficient.

\index{sample-free methods}

In other cases a fit based on *entropy maximisation* may be required.
This concept involves finding the population that is most likely to
represent the micro level populations (Bierlaire, 1991) [@Bierlaire].  \index{entropy maximisation}
This chapter demonstrates
two options for population synthesis when
real individual level data is unavailable:

- *Global cross-tables and local marginal distributions* (\@ref(CrossGlobLocalMarg)) 
explains a method for cases where the constraints consist in cross-tables not spatially located
and local marginal distributions.
- *Two level aggregated data* (\@ref(twoLevelData)) contains a procedure to 
make a spatial microsimulation when having data at different aggregated levels, 
for example, one for the provinces and one for the districts.

## Global cross-tables and local marginal distributions {#CrossGlobLocalMarg}

Assume we have a contingency table of constraint variables
for the entire study area (but not at the local level) in the 
aggregate level data. This multi-dimensional
cross-table (the seed) could be the result of a 
previous step such as the implementation of
IPF
re-weight individual level data to fit the case-study area of interest.

If the marginal distributions for small areas are known, 
we can use the **mipfp** function as previously shown. If, however,
the only information about the zones is the total population living
there, the function is usable only when considering the zone as
a variable. In this specific case, having 
no additional data, the only option corresponds to 
re-scale the global cross-table for each zone. Note that this
implies that the correlations between the variables
are independent of the zone in question. 

To illustrate, we will develop the SimpleWorld example
(which can be loaded from the book's data directory by entering
`source("code/SimpleWorld.R")` or was previously loaded if you as followed (\@ref(data-prep))) with 
adapted constraints. When watching the available data in an aggregated level, 
we have for the moment:

```{r,echo=FALSE, message=FALSE}
source("code/SimpleWorld.R")
```

```{r}
# Cross-tabulation of individual level dataset
table(ind$age, ind$sex)

(total_pop <- rowSums(con_sex)) # total population of each zone
```

To illustrate this section, the local constraint will be the total number
of people in each zone (last column of `consTot`). The global constraint
is a matrix of the form of the cross-table between age and sex, but 
including the total population (33 people for SimpleWorld). The new 
constraints could be:

```{r}
# Global Constraint possible for SimpleWorld
global_cons <- table(ind$age, ind$sex)
global_cons[1,] <- c(6,9)
global_cons[2,] <- c(7,11)

# Local Constraint for SimpleWorld
local_cons <- total_pop
```

When only the total population is known for each zone, the best way to 
create a synthetic population is to simply re-scale the cross-table.
For each zone, a table proportional to the global one is created.
The results are stored in a three dimensional array, which first
dimension represents the zone. The initialisation of the resulting
matrix is the first step. We here fill in the table with "0".


```{r}
# initialise result's array and its names
resNames <- list(1:nrow(cons), rownames(global_cons), 
            colnames(global_cons))
res <- array(0, dim=c(nrow(cons), dim(global_cons)), 
             dimnames=resNames)
```

Now the final weight table is calculated, simply by 
taking the global matrix and re-scaling it to fit the 
the desired marginals. In this way we keep the global
proportions, but with the correct total per zone.
Note that making this process is exactly the same as 
running `mipfp` on the seed table with as constraints
only the zone marginals.

```{r}
# Re-scale the cross-table to fit the zone's constraints
for (zone in 1:length(total_pop)){ # loop over the zones
  res[zone,,] <- global_cons * total_pop[zone] / sum(global_cons)
}

# Print the cross-table for zone 1
res[1,,]
```

We can verify that the total population
per zone is of the desired size. We can also 
check the global table of age and sex. 
This means that we have now weights fitting
well all available data.

```{r}
# Check the local constraints for each zone (should be TRUE)
for (zone in 1:length(total_pop)){
 print( sum(round(res[zone,,])) == total_pop[zone] )
}

# Save the global final table
SimTot <- apply(res,c(2,3),sum)

# Check the global constraint (should be 0)
sum(SimTot - global_cons)
```

As with IPF, the fractional result needs to be integerised to create
spatial microdata. The `round()` function generally provides a reasonable approximation, in terms
of fitting the constraints. However, the aforementioned integerisation algorithms such as
*truncate, replicate, sample* (TRS) can also be used.
This method cannot be followed exactly, because we want to perfectly fit 
the few constraints we have. In our example, a satisfactory result is achieved by using 
the round function, as shown in the code below.


```{r}
# Integerisation by simply using round
resRound <- round(res)
resTruncate <- floor(res) # take the minimum integer value

# Zero error achieved by rounding for global constraint
apply(resRound, c(2,3), sum) - global_cons

# Zero error achieved by rounding for local constraint
apply(resRound,c(1),sum) - local_cons
```

It is due to luck (and the small size of the SimpleWorld example)
that the `round` method works in this case: in most 
cases there will be errors due to rounding. 
If a zone had 4 individuals and three categories, for example,
the resulting weights could be
$(\frac{4}{3},\frac{4}{3},\frac{4}{3})$. Then, 
the rounding gives $(1,1,1)$ and there would be too
few individuals in the synthetic population (3 not 4). We can try the algorithms proposed
in (\@ref(sintegerisation)). However, as illustrated
by the following code chunks, these integerisation methods lead to
errors in relation to the constraints.

```{r}
# Integerisation with pp
res_pp <- int_pp(res)

apply(res_pp, c(2,3), sum) - global_cons
```

These errors are often very small and if
you model a whole country, the relative error is
small. Note that this little error comes from the
random draw at last stage of the algorithm.
Here, TRS is better than PP, as explained
in Chapter 5.

```{r}
# Integerisation with trs
set.seed(17)
res_trs <- res_pp <- array(dim = dim(res))
# Apply trs (see code/functions.R to see how int_trs works)
res_trs[] <- int_trs(res)

# Print the errors
apply(res_trs, c(2,3), sum) - global_cons
```

If desired, we can adapt TRS to ensure it fits 
fit the constraints at the end of the process.
To adapt the method to use TRS, we first 
truncate^[Note that truncate means 
round each weight to the first integer under the weight. This implies
that we underestimated the population.] the data and identify the missing individuals, in 
terms of constraints.

```{r}
# Truncate
resTruncate <- floor(res)

# number of missing individuals
sum(total_pop) - sum(resTruncate)
```

This means that, in total, 4 individuals are missing
after we have truncated. We will have to chose which categories
will be incremented. For this, the basic TRS take the decimal parts
of the weights (that were forgotten when truncate) and
make a random draw inside this distribution. This is in this step that
we can add an error in terms of the constraints. To make a better fit after
integerisation, we need to 
observe in which category and in which zone 
we have to add individuals.

```{r}
# Calculate the total simulated cross-table
# After truncate
SimTotTruncate <- apply(resTruncate,c(2,3),sum)

# Number of missing individuals per category
# After truncate
ToAdd <- global_cons - SimTotTruncate
ToAdd

# Number of missing individuals per zone
# After truncate
ToComplete <- local_cons - apply(resTruncate,c(1),sum)
ToComplete
```

We observe that the individuals to add are one per
category of age and sex. In terms of zones, one individual is
missing in zone one and three, whereas two persons will have to
be added in zone 2.

The principle now is to add people in the not completed zones and
categories. The cells to be incremented are always chosen as the one
with the bigger decimal parts (whereas in TRS, these decimals act as probabilities). Note that we chose
to adapt the `resTruncate` instead of defining
another tabular. 

The code works as followed: As long as there are missing
individuals in the matrix, we look at next biggest decimal and add
an individual in the corresponding cell if someone is missing in this 
category and in this zone.

\pagebreak

```{r}
# Calculate the decimals left by truncate
decimals <- res - resTruncate

# Adapting resTruncate to fit all constraints
while (sum(total_pop) - sum(resTruncate) > 0){
  # find the biggest decimals
  i <- which( decimals == max(decimals), arr.ind = TRUE)

  # remember we already considered this cell
  decimals[i] <- 0

  # if this zone still miss individuals
  if (ToComplete[i[1]] > 0){
    # if this category still miss individuals
    if (ToAdd[i[2],i[3]] > 0){
      resTruncate[i] <- resTruncate[i] + 1
      ToComplete[i[1]] <- ToComplete[i[1]] - 1
      ToAdd[i[2],i[3]] <- ToAdd[i[2],i[3]] - 1
    }
  }
}
```

The new values in `resTruncate` follow all constraints. 
The adaptation of TRS could be avoided by using combinatorial
optimization to integerise. Indeed, this process could choose for 
each cell to take the integer just under or just above the weight by
optimising the fit to the constraints. We use TRS here because it is 
faster and requires fewer lines of code [see @Lovelace2013-trs for more detail].

After the integerisation, the last step to 
get the final individual dataset is the 
expansion. This stage is intuitive, since
we have now a table containing the number of
individuals in each category. Thus, we simply
need to replicate the combination of categories
the right number of times.

We can first flatten the 3–dimensional matrix.
Then, the final individual micro dataset `ind_data`
is created.

```{r}
countData <- as.data.frame.table(resTruncate)
indices <- rep(1:nrow(countData), countData$Freq)
ind_data <- countData[indices,]
```

## Two level aggregated data {#twoLevelData}

We present here how to find a possible distribution per zone when 
having only aggregated data, but in two different levels of aggregation.
For example, we have some data for the municipalities and other for
the districts. A first proposition can be to use a genetic algorithm that 
minimises the distance between the constraint and the simulation. This 
can give very good solutions, but need a high level understand of
optimisation and is rare in the literature for the moment. For classical
problems, a simpler method is available. The basis of this method is
explained here. The solution proposed by Barthélemy and Toint (2013), and used in this book,
is to generate a 'seed' before executing IPF. 

This paper demonstrates the simulation of a population with four characteristics
per individual: the gender, the age class, the diploma level and the
activity status and at the municipality level. Their available
data was: 

1. At municipality level: the cross table gender x age and the marginals of diploma level and activity status;
2. At district level: the cross tables gender x activity status, gender x diploma level, age x activity status and age x diploma level.

Note that a district contains several municipalities, but 
each municipality is associated to only one district. 
We consider the marginals of the tables being consistent. If not, 
a preliminary step is necessary to re-scale the data 
to avoid shifting to probabilities. We chose to do this to
have the best chance to fit the data well. When shifting to
probabilities, it is more difficult to adapt the distributions during
the iteration. Indeed, when considering the theoretical counts, 
if you create a young women, you just need to take the
cell 'young' and 'women' and make minus one. When considering
probabilities, when you create a young woman, you have to recalculate all
probabilities, because you still need less women and proportionally, more men
than before. This is the reason why we prefer to adapt the distributions
to the one we are the more confident.

The global idea of their method is to proceed in two steps. 
First, they simulate the cross table  of the 
four variables per district. Then, this table is considered
as the seed of the IPF algorithm to
simulate the distributions per municipality. During
this second stage, the data concerning the municipality 
are used as constraints. How to execute the second part 
has been explained in the first section of this chapter. 
The point here is to develop the process, per district, 
to simulate the four–dimensional cross table, 
with the available data. This is also done in two steps :

1. Generate age x gender x diploma level and age x gender x professional status;
2. Generate age x gender x diploma level x professional status.

For the first step, we will explain only the 
creation of the first cross table, since the second
reasoning is similar. The idea is simply to proceed 
proportionally to respect both available tables.
The pseudo-code below corresponds to the code 
provided by Barthélemy and Toint (2013).

For the clarity of the formal formula, we rename
gender (A), age (B) and diploma level (C). To create
the cross table of these three variables, 
we have at the district level the cross tables
gender x diploma level (renamed AC) and 
age x diploma level (renamed BC). Then, 
the cells of the three–dimensional
table is defined for each gender $g$,
age $a$ and diploma level $d$ as followed :

$$ABC(g,a,d)=\frac{AC(g,d)}{margin(d)}BC(a,d)$$

The formula is intuitive. The fraction gives the proportion 
of each gender inside the considered
category of diploma level. Then, this proportion splits 
the number of persons having characteristics a and d
into the category of g. 
For example, in the specific case of defining 
(Male, Young, Academics), we will have :

$$ABC(Male, Young, Aca)=\frac{AC(Male,Aca)}{\#Aca}BC(Young,Aca)$$

Suppose we have 50 young academics out of 
150 academics (90 female and 60 male). We would have:

$$ABC(Male, Young, Aca)=\frac{60}{150}*50=20$$
$$ABC(Female, Young, Aca)=\frac{90}{150}*50=30$$

Thus, the tables age x gender x diploma level
and age x gender x professional status are simulated. 
The seed for the IPF function can now be
established, with help of the two contingencies. 
These initial weights will be the distribution of the four
variables inside the whole district. 

This seed is generated by several iterations. The initialisation 
of the cross table is simply a matrix with the right number of dimensions, 
with "0" in impossible cells and "1" in potentially 
non empty cases. For example, individuals of less than 
10 years cannot hold a diploma from university. 

\pagebreak
With this initial point, an IPF can be performed to fit the two
previously determined three–dimensional tables. The result 
is a table with the four variables per district. 

The final step is explained in 
the previous section. Indeed, we have a contingency table
at the district level and the zone margins. 
Note that you can imagine a lot of
combinations of IPF to perform a population
synthesis adapted to your own data.

## Chapter summary

In summary spatial microsimulation can be used in situations where no sample data is available.
The techniques can be adapted to work with a synthetic population.
This chapter presented two methods for creating synthetic populations in R, the selection
of which should depend on the type of 
input constraint data available. The first method assumed access to
global cross tables and local marginals. The second
assumed having aggregate data at different levels.