Main tasks #2

vkhodygo · 2022-06-10T13:44:42Z

vkhodygo
Jun 10, 2022
Maintainer

We have 6 outcomes, some of which must be constructed from the LABSim output:

mean of continuous GHQ12 score (variable dhm)
prevalence (mean of a dummy) of GHQ12 caseness (dummy=1 if dhm <= 24)
prevalence (mean of a dummy) of employment (dummy=1 if les_c4 == "EmployedOrSelfEmployed")
mean hours worked (convert laboursupplyweekly to numerical)
mean household income (variable equivaliseddisposableincomeyearl)
prevalence (mean of a dummy) of poverty (dummy =1 if atriskofpoverty == 1 || atriskofpoverty == null)

We construct 13 groups, and also include results for the whole population:

male (dgn == "Male")
female (dgn == "Female")
age 25-44 (dag >= 25 && dag < 45)
age 45-64 (dag >= 45 && dag < 65)
household with children (one of n_children_1-17 != 0)
household without children (all of n_children_1-17 == 0 or missing)
employed (les_c4 == "EmployedOrSelfEmployed")
unemployed (les_c4 == "NotEmployed")
in-work poverty (grp_emp == 1 && out_poverty == 1)
out-of-work poverty (grp_emp == 0 && out_poverty == 1)
Low education (deh_c3 == "Low")
medium education (deh_c3 == "Medium")
high education (deh_c3 == "High")

We want:

aggregate mean outcomes for the baseline and reform, in each year, and year-run pairs.
Effect estimates (difference in aggregate outcomes between the baseline and reform), again for each year, and each year-run pair.
Ranking of outcomes and effect estimates, again for each year, and each year-run pair.

Table structure:

scenario	run	time	grp_all	grp_male	grp_female	grp_age25	grp_age45	grp_hchild	grp_nchild	grp_emp	grp_unemp	grp_povin	grp_povout	grp_edlow	grp_edmed	grp_edhi	out_ghq_base	out_ghq_ref	eff_ghq	rank_ghq_base	rank_ghq_ref	rank_eff_ghq	out_ghqcase_base	out_ghqcase_ref	eff_ghqcase	rank_ghqcase_base	rank_ghqcase_ref	rank_eff_ghqcase	out_emp_base	out_emp_ref	eff_emp	rank_emp_base	rank_emp_ref	rank_eff_emp	out_emphrs_base	out_emphrs_ref	eff_emphrs	rank_emphrs_base	rank_emphrs_ref	rank_eff_emphrs	out_income_base	out_income_ref	eff_income	rank_income_base	rank_income_ref	rank_eff_income	out_poverty_base	out_poverty_ref	eff_emp	rank_poverty_base	rank_poverty_ref	rank_eff_poverty
scenario number or description	the run number, missing for overall results combining all runs	this is the year	dummy variable (1 if the results relate to the whole population, zero otherwise)	dummy variable (1 if the results relate to the male population, zero otherwise)	dummy variable (1 if the results relate to the female population, zero otherwise)	dummy variable (1 if the results relate to the age 25-44 population, zero otherwise)	dummy variable (1 if the results relate to the age 45-64 population, zero otherwise)	dummy variable (1 if the results relate to households with children in population, zero otherwise)	dummy variable (1 if the results relate to households without children in population, zero otherwise)	dummy variable (1 if the results relate to the employed population, zero otherwise)	dummy variable (1 if the results relate to the unemployed population, zero otherwise)	dummy variable (1 if the results relate to the in-work poverty population, zero otherwise)	dummy variable (1 if the results relate to the out-of-work poverty population, zero otherwise)	dummy variable (1 if the results relate to the low education population, zero otherwise)	dummy variable (1 if the results relate to the medium education population, zero otherwise)	dummy variable (1 if the results relate to the high education population, zero otherwise)	mean of continuous GHQ12 score for baseline	mean of continuous GHQ12 score for the reform	effect of reform on continuous GHQ12 score for this run and year (out_ghq_base minus out_ghq_reform)	rank of out_ghq_base for group and year	rank of out_ghq_ref for group and year	rank of eff_ghq for group and year	mean of dummy GHQ12 caseness for baseline	mean of dummy GHQ12 caseness for the reform	effect of reform on dummy GHQ12 caseness for this run and year (out_ghqcase_base minus out_ghqcase_reform)	rank of out_ghqcase_base for group and year	rank of out_ghqcase_ref for group and year	rank of eff_ghqcase for group and year	mean of employment dummy for baseline	mean of employment dummy for the reform	effect of reform on employment dummy for this run and year (out_emp_base minus out_emp_reform)	rank of out_emp_base for group and year	rank of out_emp_ref for group and year	rank of eff_emp for group and year	mean of employment hours for baseline	mean of employment hours for the reform	effect of reform on employment hours for this run and year (out_emp_base minus out_emp_reform)	rank of out_emphrs_base for group and year	rank of out_emphrs_ref for group and year	rank of eff_emphrs for group and year	mean of income for baseline	mean of income for the reform	effect of reform on income for this run and year (out_income_base minus out_income_reform)	rank of out_income_base for group and year	rank of out_income_ref for group and year	rank of eff_income for group and year	mean of poverty dummy for baseline	mean of poverty dummy for the reform	effect of reform on poverty dummy for this run and year (out_poverty_base minus out_poverty_reform)	rank of out_poverty_base for group and year	rank of out_poverty_ref for group and year	rank of eff_poverty for group and year

vkhodygo · 2022-06-13T19:34:21Z

vkhodygo
Jun 13, 2022
Maintainer Author

@dkopasker

Could you please clarify how to calculate prevalence in this case? Also, I'd like to know how to format the output file, the example you provided is a bit confusing.

0 replies

dkopasker · 2022-06-14T10:04:45Z

dkopasker
Jun 14, 2022
Collaborator

Prevalence is the proportion of the sample with a specific characteristic. If the characteristic is defined by a dummy variable, as is the case here, then the mean of the dummy variable gives the prevalence.

Regarding the format, I provided a csv example that I made in Stata. This was provided with the expectation that you would try to reproduce it using more efficient code. A separate file of column descriptions was provided to assist in this. Please let me know if there is anything specific that remains confusing. Equally, if you think a different format would be useful, I am open to suggestions.

0 replies

vkhodygo · 2022-06-14T12:25:00Z

vkhodygo
Jun 14, 2022
Maintainer Author

@dkopasker

dummy variable, as is the case here

You say that

(dummy=1 if dhm<=24)

does it mean automatically

(dummy=0 if dhm>24)

I.e., you transform a continuous variable into a boolean one (sort of). I'm familiar with this kind of transformations, I just want to be sure I understand everything correctly.

0 replies

dkopasker · 2022-06-14T12:44:28Z

dkopasker
Jun 14, 2022
Collaborator

Almost. The dummy could be missing if dhm is missing. Stata treats missing as infinity, so dummy=0 if dhm>24 & dhm!=. would be needed. This may not be the case with other software.

0 replies

vkhodygo · 2022-06-14T12:48:30Z

vkhodygo
Jun 14, 2022
Maintainer Author

Fair enough, I keep forgetting about missing values all the time. So, the approach is to interpret all missing values as 0/null/false etc.?

One more thing, you mentioned employed as a separate group, but for them prevalence of employment makes no sense. What do I miss here?

0 replies

dkopasker · 2022-06-14T12:57:17Z

dkopasker
Jun 14, 2022
Collaborator

The approach is to treat any variable determined by a missing variable as also being missing. Zero and false are not usually missing values. Null usually is.

You are correct that prevalence of employment will not vary in the employed group. Likewise, there are poverty groups where the prevalence of poverty will not vary. However, it was more efficient to run a loop covering all outcomes for all groups. This also provides a check that things are working as they should.

0 replies

vkhodygo · 2022-06-14T13:10:17Z

vkhodygo
Jun 14, 2022
Maintainer Author

Fair enough.

I went through the data columns now to load only what is needed. It looks like out_poverty is missing for some reason.

0 replies

dkopasker · 2022-06-14T13:17:00Z

dkopasker
Jun 14, 2022
Collaborator

Where is out_poverty missing?

0 replies

vkhodygo · 2022-06-14T13:21:04Z

vkhodygo
Jun 14, 2022
Maintainer Author

Both grp_emp (just noticed) and out_poverty columns are missing in one of the CSV files I converted from the .dta data.

> colnames(df)
 [1] "V1"                               "run"                              "time"                             "id_household"                    
 [5] "id_benefitunit"                   "id_female"                        "id_male"                          "id_person"                       
 [9] "id_father"                        "id_mother"                        "id_original"                      "id_partner"                      
[13] "hh_dwt"                           "hh_size"                          "atriskofpoverty"                  "dhh_owned"                       
[17] "dhhtp_c4"                         "disposableincomemonthly"          "equivaliseddisposableincomeyearl" "n_children_0"                    
[21] "n_children_1"                     "n_children_10"                    "n_children_11"                    "n_children_12"                   
[25] "n_children_13"                    "n_children_14"                    "n_children_15"                    "n_children_16"                   
[29] "n_children_17"                    "n_children_2"                     "n_children_3"                     "n_children_4"                    
[33] "n_children_5"                     "n_children_6"                     "n_children_7"                     "n_children_8"                    
[37] "n_children_9"                     "occupancy"                        "region"                           "bu_size"                         
[41] "ydses_c5"                         "adultchildflag"                   "dag"                              "dcpagdf"                         
[45] "dcpen"                            "dcpex"                            "dcpst"                            "dcpyy"                           
[49] "ded"                              "deh_c3"                           "dehf_c3"                          "dehm_c3"                         
[53] "dehsp_c3"                         "der"                              "dgn"                              "dhe"                             
[57] "dhesp"                            "dhm"                              "dlltsd"                           "inversemillsratiomaxfemale"      
[61] "inversemillsratiomaxmale"         "inversemillsratiominfemale"       "inversemillsratiominmale"         "laboursupplyweekly"              
[65] "les_c4"                           "les_c7_covid"                     "lesdf_c4"                         "lessp_c4"                        
[69] "potentialearnings"                "sindex"                           "sindexnormalised"                 "weight"                          
[73] "yearlyequivalisedconsumption"     "ynbcpdf_dv"                       "yplgrs_dv"                        "ypnbihs_dv"                      
[77] "ypncp"                            "ypnoab"                           "yptciihs_dv"                      "scaling_factor"

0 replies

dkopasker · 2022-06-14T13:38:00Z

dkopasker
Jun 14, 2022
Collaborator

I see. These outcomes must be constructed based on the rules defined at the top of this page. I have used dummy= for three outcomes to show they are dummy variables. Dummy is not the variable name, these would be out_ghqcase, out_emp, and out_poverty. There is an additional suffix to the variable name to indicate if the outcome relates to the baseline or reform arm of the scenario.

The other outcomes, those that are not dummy variables, have the variable names out_ghq, out_emphrs, and out_income. Again with the suffix attached.

Rules to construct groups were also provided, but not the variable names for the output. All group variable have the prefix grp and should be self-explanatory.

0 replies

vkhodygo · 2022-06-14T18:03:41Z

vkhodygo
Jun 14, 2022
Maintainer Author

I get it, I'll fix the names tomorrow. Meanwhile, you can take a look at the first version here: 28c8ef3. The path to the data file within the project directory should be data/S1/baseline.csv.gz, however, this part of the code can be edited when needed.

It's a bit messy, no tests, no different scenarios at the moment, but it can produce most of the results and it's performant enough (takes about two minutes to get it done).

0 replies

dkopasker · 2022-06-15T08:29:59Z

dkopasker
Jun 15, 2022
Collaborator

That seems a good start. More comments in the code would be helpful as I am less familiar wiith R.

0 replies

vkhodygo · 2022-06-15T12:41:49Z

vkhodygo
Jun 15, 2022
Maintainer Author

More comments in the code would be helpful due as I am less familiar with R.

Noted, that's of great help for everyone since my R skills are rusty as well.

0 replies

vkhodygo · 2022-06-17T15:53:34Z

vkhodygo
Jun 17, 2022
Maintainer Author

@dkopasker I finally have a version that calculates the metrics for the whole population, modifying it for other cases should not be that difficult. However, I never asked you about the definition of rank.

0 replies

dkopasker · 2022-06-17T16:09:22Z

dkopasker
Jun 17, 2022
Collaborator

Rank numbers the observations for each outcome from highest to lowest. Equal observations are assigned the average rank. For example, rank_ghq_base orders observations on out_ghq_base from 1 to N. Ranking must be done after collapsing to one observation per run per year. N is the number of runs. The rank will be used to define percentiles relevant to a 95% confidence interval.

0 replies

vkhodygo · 2022-06-17T16:53:48Z

vkhodygo
Jun 17, 2022
Maintainer Author

Good to know, I'll add this to the code.

See 62f28ba for the latest version, no ranks and no effects at the moment, but all groups are now included. This takes about 90 seconds to get done on my machine with the S1 data sample, should be fine to try it on yours.

0 replies

dkopasker · 2022-06-20T08:52:57Z

dkopasker
Jun 20, 2022
Collaborator

A 115GB file of output from 1,000 runs of 1 arm of the COVID scenario is now stored on the T drive. Compressing the file was predicted to take 11 hours, but copying took only 20 minutes. We can see how long this file takes to run on our laptops, but getting R installed on the "machine up the stairs" is likely to be beneficial.

0 replies

vkhodygo · 2022-06-20T16:16:09Z

vkhodygo
Jun 20, 2022
Maintainer Author

The latest commit provides rank support now. It is possible for you to run this script on your machine, 16GB of RAM is enough for 50 runs and 1 scenario. I think, even 8GB should do the trick.

The code itself is messy, it'd be great to make it more compact and clear. No proper docs at the moment, but there are comments everywhere.

0 replies

andrewbaxter439 · 2022-07-15T10:15:30Z

andrewbaxter439
Jul 15, 2022
Collaborator

Hi @vkhodygo and @dkopasker - this is helpful for finding my way through the output file. I was wondering if it would be more helpful to have the grp_ dummies as a single column in a longer dataset (with factor vals male, female etc.)? Also wondering if a) the effect columns would be needed in this output (can be calculated as needed further down line) and b) the sd's of variables for each run would be handy to have?

Putting these in discussion for now - if it's a simple answer of no they'll stay this way then note here, otherwise can open as issues if further discussion is warranted.

3 replies

dkopasker Jul 15, 2022
Collaborator

The group dummies could be combined in one column, although it's not clear to me how this would provide any substantial benefit. The effects columns are the essential part of this dataset. It would be helpful to have code that can be easily adapted to calculate other groups and group-specific variables, of which the standard deviation is one example. The main priorities are speed and adaptability, bearing in mind that this code starts with a +100gb dataset.

andrewbaxter439 Jul 15, 2022
Collaborator

Thanks Daniel, that's helpful. Apologies again that these are fresh thoughts having just come to dataset!

In terms of grouping variables, just trying to think through intuitively where I'd start with analysis -- likely by filtering out subgroups I wouldn't use. The grp variables seem to be mutually exclusive, so coding one piece of information across 14 cols? Currently would mean filtering on one grp variable and dropping the rest. For example to analyse all:

library(readr)
library(tidyverse)

out_data <-
  read_csv("C:/Programming/covid19_effect_estimates/data/new_data.csv",
           show_col_types = FALSE)


out_data |>
  filter(grp_all == TRUE, !is.na(run)) |>
  select(-starts_with("grp"))
#> # A tibble: 450 × 39
#>    scenario   run  time out_ghq_baseline out_emp_baseline out_emphrs_baseline
#>    <chr>    <dbl> <dbl>            <dbl>            <dbl>               <dbl>
#>  1 S1           1  2017             26.2            0.489                15.3
#>  2 S1           1  2018             25.5            0.485                15.2
#>  3 S1           1  2019             25.0            0.481                15.0
#>  4 S1           1  2020             24.1            0.448                16.6
#>  5 S1           1  2021             23.9            0.425                15.9
#>  6 S1           1  2022             24.1            0.468                14.6
#>  7 S1           1  2023             23.9            0.465                14.4
#>  8 S1           1  2024             23.8            0.465                14.4
#>  9 S1           1  2025             23.6            0.460                14.3
#> 10 S1           2  2017             25.7            0.488                15.4

It's whether this would be more easily done by something like:

out_data |>
  filter(group == "All")

Disadvantage here over the 14-variable method would be the complication of combining subgroups - e.g. Male + Under 25 + unemployed. An in-between method could be condensing the mutually exclusive categories:

Sex	Age	Employed	Poverty	Children	Education
Male/Female/All	Under25/Under45/All	Employed/Unemployed/All	In/Out/All	Has/None/All	Hi/Mid/Low/All

vkhodygo Jul 15, 2022
Maintainer Author

The file formatting follows @dkopasker's request, it's not that difficult to adjust it.

dkopasker · 2022-07-15T14:07:29Z

dkopasker
Jul 15, 2022
Collaborator

Groups are not mutually exclusive. As you show, it is possible to be male and unemployed. For this reason we need multiple group identifiers within the individual level data. The aggregated output data could have a single group variable, if this provided some benefit.

0 replies

andrewbaxter439 · 2022-07-15T14:43:46Z

andrewbaxter439
Jul 15, 2022
Collaborator

A further quick question unrelated to the above (again, excuse ignorance) - would it be at all helpful to have the 100Gb+ files read in and processed in batches, i.e. one run at a time (then appended to output file)? Can be accomplished by using fread to filter before importing or potentially by converting output csv of 1,000 runs to an SQLite database and reading in with SELECT * WHERE run =1?

The only potential advantage of this of course would be if it sped things up by not having whole dataset in R at once.

2 replies

dkopasker Jul 15, 2022
Collaborator

This one can be answered by @vkhodygo. In earlier Stata work I used batches to speed things up, but the R code is apparently much quicker. LABSim can output to SQL, if that is useful.

vkhodygo Jul 15, 2022
Maintainer Author

It's a little bit more complex than that. LABSim produces 3 tables for each run which are of somehow moderate size. The original approach was to use Stata to merge them, this resulted in bloated files as you can imagine.

@dkopasker suggested to drop all columns we don't need to save some RAM, that helped me immensely. Still, the resulting data is not read from the disk but rather created on the fly if you follow me.

If you see any ways to improve the code I'd like to hear you thoughts.

dkopasker · 2022-08-17T10:53:24Z

dkopasker
Aug 17, 2022
Collaborator

Based on conversations with Chris Kypridemos, we should be reporting medians rather than means. Do we need to make any updates to the code? We already rank the outcomes, so the 50th percentile is there already.

2 replies

vkhodygo Aug 17, 2022
Maintainer Author

As far as I understand, we do.

https://github.com/MRC-CSO-SPHSU/covid19_effect_estimates/blob/6414572aaef277f94c783bc4a03bdd76648e9b7f/main.R#L44

That's part of the code that does all the aggregation for you, we need to replace it with median. Still, I need to go through the code/data again to be sure.

@andrewbaxter439 Your thoughts?

andrewbaxter439 Aug 17, 2022
Collaborator

As far as I understand (and correct me if I wasn't following conversations completely!) it would be most consistent to take the mean of the population in each run as the outcome and in bootstrapping to plot the median?

The CI ribbons are currently centring round the median of bootstrapped population mean values and plotting to 2.5th and 97.5th quantiles:

https://github.com/MRC-CSO-SPHSU/covid19_effect_estimates/blob/96667cfe05c971009d5a647b38326a653e34fff3/R/graphing_functions.R#L36-L41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main tasks #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 22 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Main tasks #2

vkhodygo Jun 10, 2022 Maintainer

We have 6 outcomes, some of which must be constructed from the LABSim output:

We construct 13 groups, and also include results for the whole population:

We want:

Table structure:

Replies: 22 comments · 7 replies

vkhodygo Jun 13, 2022 Maintainer Author

dkopasker Jun 14, 2022 Collaborator

vkhodygo Jun 14, 2022 Maintainer Author

dkopasker Jun 14, 2022 Collaborator

vkhodygo Jun 14, 2022 Maintainer Author

dkopasker Jun 14, 2022 Collaborator

vkhodygo Jun 14, 2022 Maintainer Author

dkopasker Jun 14, 2022 Collaborator

vkhodygo Jun 14, 2022 Maintainer Author

dkopasker Jun 14, 2022 Collaborator

vkhodygo Jun 14, 2022 Maintainer Author

dkopasker Jun 15, 2022 Collaborator

vkhodygo Jun 15, 2022 Maintainer Author

vkhodygo Jun 17, 2022 Maintainer Author

dkopasker Jun 17, 2022 Collaborator

vkhodygo Jun 17, 2022 Maintainer Author

dkopasker Jun 20, 2022 Collaborator

vkhodygo Jun 20, 2022 Maintainer Author

andrewbaxter439 Jul 15, 2022 Collaborator

dkopasker Jul 15, 2022 Collaborator

andrewbaxter439 Jul 15, 2022 Collaborator

vkhodygo Jul 15, 2022 Maintainer Author

dkopasker Jul 15, 2022 Collaborator

andrewbaxter439 Jul 15, 2022 Collaborator

dkopasker Jul 15, 2022 Collaborator

vkhodygo Jul 15, 2022 Maintainer Author

dkopasker Aug 17, 2022 Collaborator

vkhodygo Aug 17, 2022 Maintainer Author

andrewbaxter439 Aug 17, 2022 Collaborator

vkhodygo
Jun 10, 2022
Maintainer

Replies: 22 comments 7 replies

vkhodygo
Jun 13, 2022
Maintainer Author

dkopasker
Jun 14, 2022
Collaborator

vkhodygo
Jun 14, 2022
Maintainer Author

dkopasker
Jun 14, 2022
Collaborator

vkhodygo
Jun 14, 2022
Maintainer Author

dkopasker
Jun 14, 2022
Collaborator

vkhodygo
Jun 14, 2022
Maintainer Author

dkopasker
Jun 14, 2022
Collaborator

vkhodygo
Jun 14, 2022
Maintainer Author

dkopasker
Jun 14, 2022
Collaborator

vkhodygo
Jun 14, 2022
Maintainer Author

dkopasker
Jun 15, 2022
Collaborator

vkhodygo
Jun 15, 2022
Maintainer Author

vkhodygo
Jun 17, 2022
Maintainer Author

dkopasker
Jun 17, 2022
Collaborator

vkhodygo
Jun 17, 2022
Maintainer Author

dkopasker
Jun 20, 2022
Collaborator

vkhodygo
Jun 20, 2022
Maintainer Author

andrewbaxter439
Jul 15, 2022
Collaborator

dkopasker Jul 15, 2022
Collaborator

andrewbaxter439 Jul 15, 2022
Collaborator

vkhodygo Jul 15, 2022
Maintainer Author

dkopasker
Jul 15, 2022
Collaborator

andrewbaxter439
Jul 15, 2022
Collaborator

dkopasker Jul 15, 2022
Collaborator

vkhodygo Jul 15, 2022
Maintainer Author

dkopasker
Aug 17, 2022
Collaborator

vkhodygo Aug 17, 2022
Maintainer Author

andrewbaxter439 Aug 17, 2022
Collaborator