forked from Robinlovelace/spatial-microsim-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-no-microdata.Rmd
424 lines (350 loc) · 16.1 KB
/
09-no-microdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# Population synthesis without microdata {#nomicrodata}
Sometimes no representative individual level
dataset is available as an input for population synthesis.
In this case, the methods described in the previous
chapters must be adapted accordingly.
The challenge
is still to generate spatial microdata
that fits all the constraint tables, but based on a purely synthetic
'seed' input cross-tabulated contingency table. Many
combinations of individual level data
could correspond to these distributions. Depending on the aim
of the spatial microsimulation model, simply finding one reasonable fit
can be sufficient.
\index{sample-free methods}
In other cases a fit based on *entropy maximisation* may be required.
This concept involves finding the population that is most likely to
represent the micro level populations (Bierlaire, 1991) [@Bierlaire]. \index{entropy maximisation}
This chapter demonstrates
two options for population synthesis when
real individual level data is unavailable:
- *Global cross-tables and local marginal distributions* (\@ref(CrossGlobLocalMarg))
explains a method for cases where the constraints consist in cross-tables not spatially located
and local marginal distributions.
- *Two level aggregated data* (\@ref(twoLevelData)) contains a procedure to
make a spatial microsimulation when having data at different aggregated levels,
for example, one for the provinces and one for the districts.
## Global cross-tables and local marginal distributions {#CrossGlobLocalMarg}
Assume we have a contingency table of constraint variables
for the entire study area (but not at the local level) in the
aggregate level data. This multi-dimensional
cross-table (the seed) could be the result of a
previous step such as the implementation of
IPF
re-weight individual level data to fit the case-study area of interest.
If the marginal distributions for small areas are known,
we can use the **mipfp** function as previously shown. If, however,
the only information about the zones is the total population living
there, the function is usable only when considering the zone as
a variable. In this specific case, having
no additional data, the only option corresponds to
re-scale the global cross-table for each zone. Note that this
implies that the correlations between the variables
are independent of the zone in question.
To illustrate, we will develop the SimpleWorld example
(which can be loaded from the book's data directory by entering
`source("code/SimpleWorld.R")` or was previously loaded if you as followed (\@ref(data-prep))) with
adapted constraints. When watching the available data in an aggregated level,
we have for the moment:
```{r,echo=FALSE, message=FALSE}
source("code/SimpleWorld.R")
```
```{r}
# Cross-tabulation of individual level dataset
table(ind$age, ind$sex)
(total_pop <- rowSums(con_sex)) # total population of each zone
```
To illustrate this section, the local constraint will be the total number
of people in each zone (last column of `consTot`). The global constraint
is a matrix of the form of the cross-table between age and sex, but
including the total population (33 people for SimpleWorld). The new
constraints could be:
```{r}
# Global Constraint possible for SimpleWorld
global_cons <- table(ind$age, ind$sex)
global_cons[1,] <- c(6,9)
global_cons[2,] <- c(7,11)
# Local Constraint for SimpleWorld
local_cons <- total_pop
```
When only the total population is known for each zone, the best way to
create a synthetic population is to simply re-scale the cross-table.
For each zone, a table proportional to the global one is created.
The results are stored in a three dimensional array, which first
dimension represents the zone. The initialisation of the resulting
matrix is the first step. We here fill in the table with "0".
```{r}
# initialise result's array and its names
resNames <- list(1:nrow(cons), rownames(global_cons),
colnames(global_cons))
res <- array(0, dim=c(nrow(cons), dim(global_cons)),
dimnames=resNames)
```
Now the final weight table is calculated, simply by
taking the global matrix and re-scaling it to fit the
the desired marginals. In this way we keep the global
proportions, but with the correct total per zone.
Note that making this process is exactly the same as
running `mipfp` on the seed table with as constraints
only the zone marginals.
```{r}
# Re-scale the cross-table to fit the zone's constraints
for (zone in 1:length(total_pop)){ # loop over the zones
res[zone,,] <- global_cons * total_pop[zone] / sum(global_cons)
}
# Print the cross-table for zone 1
res[1,,]
```
We can verify that the total population
per zone is of the desired size. We can also
check the global table of age and sex.
This means that we have now weights fitting
well all available data.
```{r}
# Check the local constraints for each zone (should be TRUE)
for (zone in 1:length(total_pop)){
print( sum(round(res[zone,,])) == total_pop[zone] )
}
# Save the global final table
SimTot <- apply(res,c(2,3),sum)
# Check the global constraint (should be 0)
sum(SimTot - global_cons)
```
As with IPF, the fractional result needs to be integerised to create
spatial microdata. The `round()` function generally provides a reasonable approximation, in terms
of fitting the constraints. However, the aforementioned integerisation algorithms such as
*truncate, replicate, sample* (TRS) can also be used.
This method cannot be followed exactly, because we want to perfectly fit
the few constraints we have. In our example, a satisfactory result is achieved by using
the round function, as shown in the code below.
```{r}
# Integerisation by simply using round
resRound <- round(res)
resTruncate <- floor(res) # take the minimum integer value
# Zero error achieved by rounding for global constraint
apply(resRound, c(2,3), sum) - global_cons
# Zero error achieved by rounding for local constraint
apply(resRound,c(1),sum) - local_cons
```
It is due to luck (and the small size of the SimpleWorld example)
that the `round` method works in this case: in most
cases there will be errors due to rounding.
If a zone had 4 individuals and three categories, for example,
the resulting weights could be
$(\frac{4}{3},\frac{4}{3},\frac{4}{3})$. Then,
the rounding gives $(1,1,1)$ and there would be too
few individuals in the synthetic population (3 not 4). We can try the algorithms proposed
in (\@ref(sintegerisation)). However, as illustrated
by the following code chunks, these integerisation methods lead to
errors in relation to the constraints.
```{r}
# Integerisation with pp
res_pp <- int_pp(res)
apply(res_pp, c(2,3), sum) - global_cons
```
These errors are often very small and if
you model a whole country, the relative error is
small. Note that this little error comes from the
random draw at last stage of the algorithm.
Here, TRS is better than PP, as explained
in Chapter 5.
```{r}
# Integerisation with trs
set.seed(17)
res_trs <- res_pp <- array(dim = dim(res))
# Apply trs (see code/functions.R to see how int_trs works)
res_trs[] <- int_trs(res)
# Print the errors
apply(res_trs, c(2,3), sum) - global_cons
```
If desired, we can adapt TRS to ensure it fits
fit the constraints at the end of the process.
To adapt the method to use TRS, we first
truncate^[Note that truncate means
round each weight to the first integer under the weight. This implies
that we underestimated the population.] the data and identify the missing individuals, in
terms of constraints.
```{r}
# Truncate
resTruncate <- floor(res)
# number of missing individuals
sum(total_pop) - sum(resTruncate)
```
This means that, in total, 4 individuals are missing
after we have truncated. We will have to chose which categories
will be incremented. For this, the basic TRS take the decimal parts
of the weights (that were forgotten when truncate) and
make a random draw inside this distribution. This is in this step that
we can add an error in terms of the constraints. To make a better fit after
integerisation, we need to
observe in which category and in which zone
we have to add individuals.
```{r}
# Calculate the total simulated cross-table
# After truncate
SimTotTruncate <- apply(resTruncate,c(2,3),sum)
# Number of missing individuals per category
# After truncate
ToAdd <- global_cons - SimTotTruncate
ToAdd
# Number of missing individuals per zone
# After truncate
ToComplete <- local_cons - apply(resTruncate,c(1),sum)
ToComplete
```
We observe that the individuals to add are one per
category of age and sex. In terms of zones, one individual is
missing in zone one and three, whereas two persons will have to
be added in zone 2.
The principle now is to add people in the not completed zones and
categories. The cells to be incremented are always chosen as the one
with the bigger decimal parts (whereas in TRS, these decimals act as probabilities). Note that we chose
to adapt the `resTruncate` instead of defining
another tabular.
The code works as followed: As long as there are missing
individuals in the matrix, we look at next biggest decimal and add
an individual in the corresponding cell if someone is missing in this
category and in this zone.
\pagebreak
```{r}
# Calculate the decimals left by truncate
decimals <- res - resTruncate
# Adapting resTruncate to fit all constraints
while (sum(total_pop) - sum(resTruncate) > 0){
# find the biggest decimals
i <- which( decimals == max(decimals), arr.ind = TRUE)
# remember we already considered this cell
decimals[i] <- 0
# if this zone still miss individuals
if (ToComplete[i[1]] > 0){
# if this category still miss individuals
if (ToAdd[i[2],i[3]] > 0){
resTruncate[i] <- resTruncate[i] + 1
ToComplete[i[1]] <- ToComplete[i[1]] - 1
ToAdd[i[2],i[3]] <- ToAdd[i[2],i[3]] - 1
}
}
}
```
The new values in `resTruncate` follow all constraints.
The adaptation of TRS could be avoided by using combinatorial
optimization to integerise. Indeed, this process could choose for
each cell to take the integer just under or just above the weight by
optimising the fit to the constraints. We use TRS here because it is
faster and requires fewer lines of code [see @Lovelace2013-trs for more detail].
After the integerisation, the last step to
get the final individual dataset is the
expansion. This stage is intuitive, since
we have now a table containing the number of
individuals in each category. Thus, we simply
need to replicate the combination of categories
the right number of times.
We can first flatten the 3–dimensional matrix.
Then, the final individual micro dataset `ind_data`
is created.
```{r}
countData <- as.data.frame.table(resTruncate)
indices <- rep(1:nrow(countData), countData$Freq)
ind_data <- countData[indices,]
```
## Two level aggregated data {#twoLevelData}
We present here how to find a possible distribution per zone when
having only aggregated data, but in two different levels of aggregation.
For example, we have some data for the municipalities and other for
the districts. A first proposition can be to use a genetic algorithm that
minimises the distance between the constraint and the simulation. This
can give very good solutions, but need a high level understand of
optimisation and is rare in the literature for the moment. For classical
problems, a simpler method is available. The basis of this method is
explained here. The solution proposed by Barthélemy and Toint (2013), and used in this book,
is to generate a 'seed' before executing IPF.
This paper demonstrates the simulation of a population with four characteristics
per individual: the gender, the age class, the diploma level and the
activity status and at the municipality level. Their available
data was:
1. At municipality level: the cross table gender x age and the marginals of diploma level and activity status;
2. At district level: the cross tables gender x activity status, gender x diploma level, age x activity status and age x diploma level.
Note that a district contains several municipalities, but
each municipality is associated to only one district.
We consider the marginals of the tables being consistent. If not,
a preliminary step is necessary to re-scale the data
to avoid shifting to probabilities. We chose to do this to
have the best chance to fit the data well. When shifting to
probabilities, it is more difficult to adapt the distributions during
the iteration. Indeed, when considering the theoretical counts,
if you create a young women, you just need to take the
cell 'young' and 'women' and make minus one. When considering
probabilities, when you create a young woman, you have to recalculate all
probabilities, because you still need less women and proportionally, more men
than before. This is the reason why we prefer to adapt the distributions
to the one we are the more confident.
The global idea of their method is to proceed in two steps.
First, they simulate the cross table of the
four variables per district. Then, this table is considered
as the seed of the IPF algorithm to
simulate the distributions per municipality. During
this second stage, the data concerning the municipality
are used as constraints. How to execute the second part
has been explained in the first section of this chapter.
The point here is to develop the process, per district,
to simulate the four–dimensional cross table,
with the available data. This is also done in two steps :
1. Generate age x gender x diploma level and age x gender x professional status;
2. Generate age x gender x diploma level x professional status.
For the first step, we will explain only the
creation of the first cross table, since the second
reasoning is similar. The idea is simply to proceed
proportionally to respect both available tables.
The pseudo-code below corresponds to the code
provided by Barthélemy and Toint (2013).
For the clarity of the formal formula, we rename
gender (A), age (B) and diploma level (C). To create
the cross table of these three variables,
we have at the district level the cross tables
gender x diploma level (renamed AC) and
age x diploma level (renamed BC). Then,
the cells of the three–dimensional
table is defined for each gender $g$,
age $a$ and diploma level $d$ as followed :
$$ABC(g,a,d)=\frac{AC(g,d)}{margin(d)}BC(a,d)$$
The formula is intuitive. The fraction gives the proportion
of each gender inside the considered
category of diploma level. Then, this proportion splits
the number of persons having characteristics a and d
into the category of g.
For example, in the specific case of defining
(Male, Young, Academics), we will have :
$$ABC(Male, Young, Aca)=\frac{AC(Male,Aca)}{\#Aca}BC(Young,Aca)$$
Suppose we have 50 young academics out of
150 academics (90 female and 60 male). We would have:
$$ABC(Male, Young, Aca)=\frac{60}{150}*50=20$$
$$ABC(Female, Young, Aca)=\frac{90}{150}*50=30$$
Thus, the tables age x gender x diploma level
and age x gender x professional status are simulated.
The seed for the IPF function can now be
established, with help of the two contingencies.
These initial weights will be the distribution of the four
variables inside the whole district.
This seed is generated by several iterations. The initialisation
of the cross table is simply a matrix with the right number of dimensions,
with "0" in impossible cells and "1" in potentially
non empty cases. For example, individuals of less than
10 years cannot hold a diploma from university.
\pagebreak
With this initial point, an IPF can be performed to fit the two
previously determined three–dimensional tables. The result
is a table with the four variables per district.
The final step is explained in
the previous section. Indeed, we have a contingency table
at the district level and the zone margins.
Note that you can imagine a lot of
combinations of IPF to perform a population
synthesis adapted to your own data.
## Chapter summary
In summary spatial microsimulation can be used in situations where no sample data is available.
The techniques can be adapted to work with a synthetic population.
This chapter presented two methods for creating synthetic populations in R, the selection
of which should depend on the type of
input constraint data available. The first method assumed access to
global cross tables and local marginals. The second
assumed having aggregate data at different levels.