-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwelfare_data.Rmd
548 lines (439 loc) · 23 KB
/
welfare_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
---
editor_options:
markdown:
wrap: 72
---
# Welfare data {#welfare-data}
Welfare data refers to any data file that contains the welfare vector
that underlies the poverty and inequality measures in each
country/year/welfare type. PIP makes use of four different data kinds to
store the welfare vectors: microdata, bin data, group data, and
synthetic data. Each of these is collected, formatted, and stored in a
particular way.
## From Raw to GMD {#welfare-origin-of-data}
### Microdata
The Global Monitoring Database (GMD) is the World Bank's repository of
multitopic income and expenditure household surveys used to monitor
global poverty and shared prosperity. The household survey data are
typically collected by national statistical offices in each country, and
then compiled, processed, and harmonized. The process of harmonization
is done at the regional level by the regional statistical development
teams. This is to ensure that the harmonization is done with best
country knowledge from the country TTLs as well as to ensure consistency
of country data overtime. Selected variables have been harmonized to the
extent possible such that levels and trends in poverty and other key
sociodemographic attributes can be reasonably compared across and within
countries over time.
The regional harmonization process also follows the global harmonization
guidelines for most of the variables in GMD. PovcalNet also contributed
historical data from before 1990, and recent survey data from Luxemburg
Income Studies (LIS).
The global update process is coordinated by the Data for Goals (D4G)
team and supported by the six regional statistics teams in the Poverty
and Equity Global Practice.
The original data files used for the regional harmonization and GMD
harmonization should be cataloged and stored in the regional shared
drive and the central Microdata Library catalog
(<http://microdatalib/>). We assume that the depositing of the original
microdata in the Microdata Library is a collective responsibility shared
by all World Bank staff working in all the World Bank's regions,
irrespective of global practice, since the dialogue and data acquisition
through the NSO and line ministries often take place in a decentralized
manner.
### Group Data
Some countries have very restrictive and confidential data sharing
conditions. In such cases, country economist, in collaboration with the
regional team and PIP team, would send a request to the country NSO for
the group or "collapsed" data for the purpose of monitoring
international poverty (SDG 1.1.1). Historically, the group data can be
exported from national statistics reports with either data format of
type 2 or type 5. In recent years, country economists would request the
group data from NSO directly. For example, we request the group data -
ventile (20) for China by urban and rural, type 2. For other countries,
like high income countries, such as UEA (Country code ARE), NSO directly
reach out to us in order to update SDG 1.1.1, in this case the group
data is shared to us and we validate the data.
### Bin Data
Some of the countries that are not available in DataLibWeb can be found
in the repository of the [LIS Cross-National Data
Center](https://www.lisdatacenter.org/) (hereafter, LIS). Currently,
PovcalNet uses LIS data for 8 high-income economies: AUS, CAN, DEU, ISR,
JPN, KOR, TWN & USA, plus the Pre-EUSILC years (generally before 2002)
of European Economies.
LIS datasets cannot be downloaded in full; however, they provide a
remote-execution system,
[LISSY,](http://www.lisdatacenter.org/data-access/lissy/) that allows us
interact with their microdata without having access to the individual
records. We have developed a set of Stata do-files to interact with
LISSY and aggregate the welfare distribution of our countries of
interest to 400 bins. Then, these data is organized locally and shared
with the Poverty GP to be included in DataLibWeb as a collection
independent from GPWG.
#### The LIS_data repository {.unnumbered}
In order to work with the LIS data you need to clone the repository
[PovcalNet-Team/LIS_data](https://github.com/PovcalNet-Team/LIS_data).
You will find in there three folders, *00.LIS_output*, *01.programs*,
and *02.data*.
#### Interacting with LISSY {.unnumbered}
**Opening an account in LISSY:**
To interact with the LIS data you need to first register
[**here**](http://www.lisdatacenter.org/lis-luli-frontend-webapp/app/request-account-identification),
by first completing the [LIS microdata User Registration
Form](http://www.lisdatacenter.org/wp-content/uploads/our-microdata-user-registration-form.pdf),
and then submitting it through the same website using your
**institutional e-mail** account. Within a couple of days, you will
receive an e-mail from LIS containing your username and password.
::: {.rmdbox .rmdnote}
*You do not get to choose your own username or password. LISdatacenter
creates both for you and those won't change in time. Make sure to save
that e-mail and record that information for your future log-ins. Also,
know that LISSY passwords expire each year on December 31st. While your
password won't change, it must be
**renewed**[here](http://www.lisdatacenter.org/lis-luli-frontend-webapp/app/request-renew-identification)
after January 1st.*
:::
**Interacting with LISSY:**
To get acquainted with LISSY's interface, coding structure, database
naming and variables available, and learn how to compute estimates
within LISSY we highly recommend taking some time to review the
[**tutorials**](https://www.lisdatacenter.org/resources/) and
[**self-teaching
materials**](https://www.lisdatacenter.org/resources/self-teaching/).
However, in order to update the 400 bins twice a year, Stata codes have
been previously written, so you simply need to follow the 5 steps in the
next section.
#### Getting the 400 bins from LISSY {.unnumbered}
**1. log in** Go to [LIS main page](https://www.lisdatacenter.org/),
scroll down and click on the lock icon
![](images/LIS_login.JPG){width="113"}
**2. Provide info** Feed the three drop-down menus on top of the
platform with the following information:
- Project: **LIS**
- Package: **Stata**
- Subject: (*Choose a name* Ex: "**Bins \#1 - Dec 2020**")
::: {.rmdbox .rmdimportant}
The LISSY platform cannot run the code for ALL surveys available at
once. If you attempt to do so, your project will stop and you will
receive an e-mail containing the text:
#####################################################
Your job has been killed and will be not executed
#####################################################
To avoid this, we need to run the code in groups of 5 to 6 countries,
depending on the amount of years in each of them. Currently, LIS has
data for [52 countries]{.ul} (26 of them are in the EUSILC project, and
26 do not) which usually take approximately [10 rounds]{.ul} of this
process.
:::
**3. Add do-file** Copy and paste the entire content of
`01.LIS_400bins.do` file into the the main large command window and
update the locals in lines 23-24 with the LIS 2-letter acronyms of the
group of 5-6 countries in each round.
```{stata, eval = FALSE}
local silc "at be cz"
local nosilc "au br"
```
::: {.rmdbox .rmdtip}
Remember to update the subject with each round (Ex: "Bins \#2 - Dec
2020") so you keep track of the number of output files. Be careful not
to leave out [or repeat] any country in the process.
:::
**4 Submit**. Click on the green arrow
![](images/LIS_submit_job.JPG){width="18" height="15"}icon to submit
your project. You will get an e-mail within some minutes with your
output. If the system kills your project, your group of countries was
probably too large. Remove one country and try again.
**5. Retrieve results** Copy the entire text in the output e-mail you
receive for each round, open your notepad and paste. Save each round in
the ***\\00.LIS_output*** folder. Save each text file with the name
**LISSY_Dec202\@\_\#.txt**, [where \@ is the year and \# the round].
Consistency with this naming format is important for the next step
(`02.LIS_organize_output.do` file)
#### From Text file to datalibweb structure {.unnumbered}
We now need to convert the the text files generated by the LISSY system
to actual data suitable for datalibweb. This structure is suggested by
the [International Household Survey Network](https://ihsn.org/) (IHSN).
Once the data is saved in folder `00.LIS_output` you need to execute the
file **`02.LIS_organize_output.do`**. This file created to be executed
in just one go. However, it could be ran in sections taking advantage of
the different
[frames](https://www.stata.com/new-in-stata/multiple-datasets-in-memory/)
along the code.
Before you execute this code, you need to ensure a few things,
**1. Get `rcall` working in your computer**
The processing of the text files is not done anymore on Stata but in R.
To avoid changing systems, we need to execute R code directly from
Stata. In order to do this, you need to make sure to have install R in
your computer and also the Stata command `rcall`. The do-file
`02.LIS_organize_output.do` will check if you have it installed and will
install it for you in case it is not. However, you can run the lines
below to make sure everything is working fine. Also, you can take a look
at the help file of `rcall` to get familiar with it.
```{stata, eval = FALSE}
cap which rcall
if (_rc) {
cap which github
if (_rc) {
net install github, from("https://haghish.github.io/github/")
}
github install haghish/rcall, stable
}
```
**2. Personal Drive**
Make sure to add your UPI to the appropriate sections it appears by
typing `` disp lower("`c(username)'") `` , following the example below,
```{stata, eval = FALSE}
if (lower("`c(username)'") == "wbxxxxx") {
local dir "c:/Users/wbxxxxx/OneDrive - WBG/WorldBank/DECDG/PovcalNet Team/LIS_data"
}
```
**3. Directives of the code**
This do-file works like an ado-file in the sense that the output depends
on the value of some local macros,
```{stata, eval = FALSE}
global update_surveynames = 1 // 1 to update survey names.
global replace = 0 // 1 to replace data in memory even if it has not changed
global p_drive_output_dir = 0 // 1 to use default Vintage_control folder
```
If local `update_surveynames` is set to 1, the code will load the sheet
`LIS_survname` from the the file `02.data/_aux/LIS datasets.xlsx` and
updated the file `02.data/_aux/LIS_survname.dta`. If `replace` is set to
1, the code will replace any output with the same name. Otherwise, it
will create a new vintage version *if the two files are different*. If
they are not different, the code will do nothing. local
`p_drive_output_dir` is deprecated, so you must leave it as **0**.
**4. Pattern of the text files**
When the text files with the information from LIS are stored in
`00.LIS_output`, they should be stored in a systematic way so that they
could be loaded and processed at the same time. This can be done by
specifying in a matching regular expression in local `pattern`. For
instance, all the files downloaded in December, 2020 could by loaded and
processed using the directive, `local pattern = "LISSY_Dec2020.*txt"`.
**5. Output**
When the do-file is concluded, it saves the file
`02.data/create_dta_status.dta` with the status of all the surveys
processed.
#### Compare new LIS data to Datalibweb inventory {.unnumbered}
To identify what data is new and what data has changed with respect to
the one available in datalibweb, you need to execute do-file
`03.LIS_compare_dlw.do`. Again, this do-file is intended to be executed
in one run, but you can do it in parts taking advantage of the different
frames. At the end of the execution the file
`02.data/comparison_results.dta` is created. This file contains three
important variables `wf`, `wt`, and `gn`, which correspond to the ration
of welfare means, weight means, and Gini coefficient between the data in
datalibweb and the data in the folder,
`p:/01.PovcalNet/03.QA/06.LIS/03.Vintage_control`.
You should only send to the Poverty GP those surveys for which at least
one of these three variables is different to 1.
#### The Excel file `LIS datasets.xlsx` {.unnumbered}
With each LIS data update performed, we must first identify from LIS the
new surveys (countries and/or years) they had recently added. LIS send
users e-mails informing about new datasets added, and also releases
[newsletters](https://www.lisdatacenter.org/news-and-events/highlights/)
with this information.
Inside the *02.data* folder of your LIS_data GIT repository you will
find a *\_aux* sub folder, and the `LIS datasets.xlsx` file placed in
there. We must **manually** update the tab **LIS_survname tab** adding
new rows to the sheet. All necessary information to fill up this
metadata (household size, currency, etc.) can be found in
[METIS](http://www.lisdatacenter.org/frontend#/home).
::: {.rmdbox .rmdtip}
**ACRONYMS:** The column survey_acronym is created by us. If you come
across a new survey for which an acronym has not been previously
established, the rules applied in the past by the team were the
following:
- Acronyms are created based on the ENGLISH name of the survey. (Ex:
*German Transfer Survey (Germany)* is "GTS", followed by the suffix
-LIS; thus GTS-LIS.
- For the surveys that were Microcensus, we created the acronym "MC",
and for Denmark's Law Model, "LM".
- All acronyms are created in capital letters.
:::
Finally, while the survey names in METIS are in English, some of the
acronyms in parenthesis are still in the original language. In those
cases we translated them to English. For instance, the survey name
"*Household Budget Survey (BdF) (France)*" from METIS was changed to
"*Household Budget Survey (**HBS**) (France)*" in the column
**surveyname** of the excel.
#### Prepare data for the Poverty GP {.unnumbered}
Finally, the do-file `04.Append_new_LIS_bases.do` prepares the data to
be shared with the Poverty GP. Note that this do-file ONLY appends the
400 bins data of surveys that are **new** and those where **welfare
changed**, which are identified in the previous step
`/comparison_results.dta` as those `gn != 1`.
Before running the code, make sure to **change the output file name** to
the **date** of your update (last one saved was
"LIS_bins_Dec_21_2020.dta"). The output is saved in
`P:\01.PovcalNet\03.QA\06.LIS\04.Share_with_GP.`
Finally, quickly prepare a short .dta file importing the metadata
already created in the LIS_survname tab from the Excel, keeping ONLY the
surveys of the append output you just run and send both files to Minh
Cong Nguyen \<mnguyen3\@worldbank.org\> from the Poverty GP.
### Synthetic Data
Synthetic (or imputed) data is used in particular instances where
household surveys do not collect income or expenditure information; or
collect partial expenditures (i.e. for a small list of items [and rotate
across different households]{style="color:red"}). In cases of that
nature, country teams incur in efforts to impute the income or
consumption welfare vectors. This exercise is based on the imputation of
welfare from one survey to another. The imputation modeling is between
the welfare, and household & individual characteristics from a previous
household survey, where data on both is available. This exercise
consists of simulating welfare information for each household in the
survey several times based on different simulated coefficients from the
imputation models. The resulting data is stacked together from each
simulated set. Final data should follow the same format as the regular
GPWG module with the exception of the variable "sim" (which indicates
the number of simulations in the stacked database).
FGT indicators are estimated in the same way as any other data; however,
for sensitive distributional indicators, calculations are run separately
for each simulation, then averaged across results to get the final
indicator. This data will get a value of 1 in the "imputed" variable of
the price framework database.
## From GMD to PIP ({pipdp} package) {#from-gmd-to-pip}
The next step in the preperation of survey data for PIP is to leverage
the [`{pipdp}`](https://github.com/PIP-Technical-Team/pipdp) package.
This package retrives survey data from DLW and standardizes it into the
format used by the PIP ingestion pipelines ([Poverty Calculator
Pipeline](#pcpipeline) and [Table Maker Pipeline](#tmpipelilne)).
### Requirements
`{pipdp}` depends on the following Stata packages.
``` {.stata}
ssc install moremata
ssc install missings
```
You will also need to have the World Bank `{datalibweb}` module and the
PovcalNet internal `{pcn}` command installed.
### Installation
``` {.powershell}
# Clone repo
git clone "https://github.com/PIP-Technical-Team/pipdp"
# Edit profile.do
notepad.exe C:\ado\personal\profile.do
# Add adopath ++ "<mypath>\pipdp"
```
### Remote server
It is recommended to both use `{pipdp}` on the PovcalNet remote server
(WBGMSDDG001). The phyiscal location of this server is much closer to
the data storage so queries to Datalibweb, as well as read/write
operations to the PIP shared network drive, will be much faster.
The only issue with using the server is that Datalibweb will require
login credentials for each new Stata session. To avoid this it is highly
recommended that users apply for direct access to Datalibweb. This will
grant access to the Datalibweb file storage and enable the possiblity of
using the `files` option in all `datalibweb` queries.
In order to use the `files` option you will also need to make some minor
adjustments to your Datalibweb setup files. In particular the global
`root` variable in `C:/ado/personal/datalibweb/GMD.do` needs to be
specified correctly. This can be achived by copying `datalibweb.ado` and
`GMD.do` in the `_aux` directory to their respective locations.
Since WBGMSDDG001 is a shared server and the DLW settings thus can be
reset, you will need to make sure that these settings are correct every
time you use `pipdp`. The modified setup files will *only* change the
behavior of Datalibweb for users that have direct access to the file
storage. It will not affect other users. You could either check and copy
the files manually or run the following from the command line.
```{cmd, eval = FALSE}
copy _aux/datalibweb.ado C:/ado/plus/d/datalibweb.ado
copy _aux/GMD.do C:/ado/personal/datalibweb/GMD.do
```
For details on how to connect to WBGMSDDG001 see the [Remote server
connection](https://povcalnet-team.github.io/Povcalnet_internal_guidelines/folder-structures.html#server)
section in the PovcalNet Internal Guidelines and Protocols.
### Usage
`pipdp` has two main commands, `pipdp group` to copy and prepare grouped
data files from the PocvalNet shared network drive and `pipdp micro` to
download and prepare micro datasets from Datalibweb. For examples on how
to use these commands see the [Usage
section](https://github.com/PIP-Technical-Team/pipdp#usage) in the
`{pipdp}`README.
### Preparing survey data for Poverty Calculator Pipeline
Follow these steps when preparing survey data for the [Poverty
Calculator Pipeline](#pcpipeline)
1. Make sure the Price Framework file is up to date in the directory
you are using.
```{r, eval = FALSE}
# Update PFW file
PIP_DATA_DIR <- pipload::create_globals(Sys.getenv("PIP_ROOT_DIR"))$PIP_DATA_DIR
# pipaux::pip_country_list('update', maindir = PIP_DATA_DIR)
pipaux::pip_pfw('update', maindir = PIP_DATA_DIR)
```
2. Make sure the Datalibweb repository file is up to date in directory
you are using. Note that if you are running this code on the
PovcalNet remote server you will be required to log in to DLW,
regardless if you have direct access or not. This is because the
underlying `datalibweb, type(GMD) repo(create dlwrepo)` command
always sends queries over web. If you want you to avoid the login
requirement you can conduct this specific step from your local
machine.
```{stata, eval = FALSE}
// Update DLW repo
pipdp dlw maindir("$PIP_DATA_DIR>")
```
3. Make sure DLW setup files are up-to-date (needed for `files`
option).
```{bat, eval = FALSE}
::Copy setup files
copy _aux/datalibweb.ado C:/ado/plus/d/datalibweb.ado
copy _aux/GMD.do C:/ado/personal/datalibweb/GMD.do
```
4. Update surveys
```{stata, eval = FALSE}
// Update grouped data surveys (from PCN-drive)
pipdp group, countries(all) maindir("$PIP_DATA_DIR>")
// Download new surveys from DLW
pipdp micro, countries(all) files new maindir("$PIP_DATA_DIR>")
```
5. Update the PIP inventory
```{r, eval = FALSE}
# Update PIP inventory
PIP_DATA_DIR <- pipload::create_globals(Sys.getenv("PIP_ROOT_DIR"))$PIP_DATA_DIR
pipload::pip_update_inventory(maindir = PIP_DATA_DIR)
```
## Survey ID nomenclature {#survey-id}
All household surveys in the PIP repository are stored following the
naming convention of the [International Household Survey Network
(IHSN)](http://ihsn.org/) for [archiving and managing
data](http://www.ihsn.org/archiving). This structure can be generalized
as follows:
CCC_YYYY_SSSS_ vNN_M_vNN_A_TYPE_MODULE
where,
- `CCC` refers to 3 letter ISO country code
- `YYYY` refers to survey year when data collection started
- `SSSS` refers to survey acronym (e.g., LSMS, CWIQ, HBS, etc.)
- `vNN` is the version vintage of the raw/master data if followed by
an `M` or of the alternative/adapted data if followed by an `A`. The
first version will always be v01, and when a newer version is
available it should be named sequentially (e.g. v02, v03, etc.).
Note that when a new version is available, the previous ones must
still be kept.
- `TYPE` refers to the collection name. In the case of PIP data, the
type is precisely, **PIP**, but be aware that there are several
other types in the datalibweb collection. For instance, the Global
Monitoring Database uses **GMD**; the South Asia region uses
**SARMD**; or the LAC region uses **SEDLAC**.
- `MODULE` refers to the module of the collection. This part of the
survey ID is only available at the *file level*, not at the folder
(i.e, survey) level. Since the folder structure is created at the
survey level, there is no place in the survey ID to include the
module of the survey. However, within a single survey, you may find
different modules, which are specified in the name of the file. In
the case of PIP, the module of the survey is divided in two: the PIP
tool and the GMD module. The *PIP tool* could be *PC*, *TB,* or
*SOL* (forthcoming), which stand for one of the PIP systems, Poverty
Calculator, Table Baker (or Maker), and Statistics OnLine. the *GMD
module*, refers to the original module in the GMD collection, such
as module *ALL*, or *GPWG, HIST,* or *BIN.*
For example, the most recent version of the harmonized Pakistani
Household Survey of 2015, she would refer to the survey ID
`PAK_2015_PSLM_v01_M_v02_A_PIP`. In this case, PSLM refers to the
acronym of the Pakistan Social and Living Standards Measurement Survey.
v01_M means that the version of the raw data has not changed since it
was released and v02_A means that the most recent version of the
alternative version is 02.2
## Survey CACHE ID nomenclature {#survey-id-cache}
[NOTE:Andres, add explanation here]{style="color:red"}
::: {.light-purple}
Testing tachyons colors
:::
This is a test [Royal blue]{style="color:royalblue"} color in line