Replies: 22 comments 7 replies
-
Could you please clarify how to calculate prevalence in this case? Also, I'd like to know how to format the output file, the example you provided is a bit confusing. |
Beta Was this translation helpful? Give feedback.
-
Prevalence is the proportion of the sample with a specific characteristic. If the characteristic is defined by a dummy variable, as is the case here, then the mean of the dummy variable gives the prevalence. Regarding the format, I provided a csv example that I made in Stata. This was provided with the expectation that you would try to reproduce it using more efficient code. A separate file of column descriptions was provided to assist in this. Please let me know if there is anything specific that remains confusing. Equally, if you think a different format would be useful, I am open to suggestions. |
Beta Was this translation helpful? Give feedback.
-
You say that
does it mean automatically
I.e., you transform a continuous variable into a boolean one (sort of). I'm familiar with this kind of transformations, I just want to be sure I understand everything correctly. |
Beta Was this translation helpful? Give feedback.
-
Almost. The dummy could be missing if dhm is missing. Stata treats missing as infinity, so dummy=0 if dhm>24 & dhm!=. would be needed. This may not be the case with other software. |
Beta Was this translation helpful? Give feedback.
-
Fair enough, I keep forgetting about missing values all the time. So, the approach is to interpret all missing values as 0/null/false etc.? One more thing, you mentioned employed as a separate group, but for them prevalence of employment makes no sense. What do I miss here? |
Beta Was this translation helpful? Give feedback.
-
The approach is to treat any variable determined by a missing variable as also being missing. Zero and false are not usually missing values. Null usually is. You are correct that prevalence of employment will not vary in the employed group. Likewise, there are poverty groups where the prevalence of poverty will not vary. However, it was more efficient to run a loop covering all outcomes for all groups. This also provides a check that things are working as they should. |
Beta Was this translation helpful? Give feedback.
-
Fair enough. I went through the data columns now to load only what is needed. It looks like |
Beta Was this translation helpful? Give feedback.
-
Where is out_poverty missing? |
Beta Was this translation helpful? Give feedback.
-
Both > colnames(df)
[1] "V1" "run" "time" "id_household"
[5] "id_benefitunit" "id_female" "id_male" "id_person"
[9] "id_father" "id_mother" "id_original" "id_partner"
[13] "hh_dwt" "hh_size" "atriskofpoverty" "dhh_owned"
[17] "dhhtp_c4" "disposableincomemonthly" "equivaliseddisposableincomeyearl" "n_children_0"
[21] "n_children_1" "n_children_10" "n_children_11" "n_children_12"
[25] "n_children_13" "n_children_14" "n_children_15" "n_children_16"
[29] "n_children_17" "n_children_2" "n_children_3" "n_children_4"
[33] "n_children_5" "n_children_6" "n_children_7" "n_children_8"
[37] "n_children_9" "occupancy" "region" "bu_size"
[41] "ydses_c5" "adultchildflag" "dag" "dcpagdf"
[45] "dcpen" "dcpex" "dcpst" "dcpyy"
[49] "ded" "deh_c3" "dehf_c3" "dehm_c3"
[53] "dehsp_c3" "der" "dgn" "dhe"
[57] "dhesp" "dhm" "dlltsd" "inversemillsratiomaxfemale"
[61] "inversemillsratiomaxmale" "inversemillsratiominfemale" "inversemillsratiominmale" "laboursupplyweekly"
[65] "les_c4" "les_c7_covid" "lesdf_c4" "lessp_c4"
[69] "potentialearnings" "sindex" "sindexnormalised" "weight"
[73] "yearlyequivalisedconsumption" "ynbcpdf_dv" "yplgrs_dv" "ypnbihs_dv"
[77] "ypncp" "ypnoab" "yptciihs_dv" "scaling_factor" |
Beta Was this translation helpful? Give feedback.
-
I see. These outcomes must be constructed based on the rules defined at the top of this page. I have used dummy= for three outcomes to show they are dummy variables. Dummy is not the variable name, these would be out_ghqcase, out_emp, and out_poverty. There is an additional suffix to the variable name to indicate if the outcome relates to the baseline or reform arm of the scenario. The other outcomes, those that are not dummy variables, have the variable names out_ghq, out_emphrs, and out_income. Again with the suffix attached. Rules to construct groups were also provided, but not the variable names for the output. All group variable have the prefix grp and should be self-explanatory. |
Beta Was this translation helpful? Give feedback.
-
I get it, I'll fix the names tomorrow. Meanwhile, you can take a look at the first version here: 28c8ef3. The path to the data file within the project directory should be It's a bit messy, no tests, no different scenarios at the moment, but it can produce most of the results and it's performant enough (takes about two minutes to get it done). |
Beta Was this translation helpful? Give feedback.
-
That seems a good start. More comments in the code would be helpful as I am less familiar wiith R. |
Beta Was this translation helpful? Give feedback.
-
Noted, that's of great help for everyone since my R skills are rusty as well. |
Beta Was this translation helpful? Give feedback.
-
@dkopasker I finally have a version that calculates the metrics for the whole population, modifying it for other cases should not be that difficult. However, I never asked you about the definition of rank. |
Beta Was this translation helpful? Give feedback.
-
Rank numbers the observations for each outcome from highest to lowest. Equal observations are assigned the average rank. For example, rank_ghq_base orders observations on out_ghq_base from 1 to N. Ranking must be done after collapsing to one observation per run per year. N is the number of runs. The rank will be used to define percentiles relevant to a 95% confidence interval. |
Beta Was this translation helpful? Give feedback.
-
Good to know, I'll add this to the code. See 62f28ba for the latest version, no ranks and no effects at the moment, but all groups are now included. This takes about 90 seconds to get done on my machine with the |
Beta Was this translation helpful? Give feedback.
-
A 115GB file of output from 1,000 runs of 1 arm of the COVID scenario is now stored on the T drive. Compressing the file was predicted to take 11 hours, but copying took only 20 minutes. We can see how long this file takes to run on our laptops, but getting R installed on the "machine up the stairs" is likely to be beneficial. |
Beta Was this translation helpful? Give feedback.
-
The latest commit provides rank support now. It is possible for you to run this script on your machine, 16GB of RAM is enough for 50 runs and 1 scenario. I think, even 8GB should do the trick. The code itself is messy, it'd be great to make it more compact and clear. No proper docs at the moment, but there are comments everywhere. |
Beta Was this translation helpful? Give feedback.
-
Hi @vkhodygo and @dkopasker - this is helpful for finding my way through the output file. I was wondering if it would be more helpful to have the Putting these in discussion for now - if it's a simple answer of no they'll stay this way then note here, otherwise can open as issues if further discussion is warranted. |
Beta Was this translation helpful? Give feedback.
-
Groups are not mutually exclusive. As you show, it is possible to be male and unemployed. For this reason we need multiple group identifiers within the individual level data. The aggregated output data could have a single group variable, if this provided some benefit. |
Beta Was this translation helpful? Give feedback.
-
A further quick question unrelated to the above (again, excuse ignorance) - would it be at all helpful to have the 100Gb+ files read in and processed in batches, i.e. one run at a time (then appended to output file)? Can be accomplished by using fread to filter before importing or potentially by converting output csv of 1,000 runs to an SQLite database and reading in with The only potential advantage of this of course would be if it sped things up by not having whole dataset in R at once. |
Beta Was this translation helpful? Give feedback.
-
Based on conversations with Chris Kypridemos, we should be reporting medians rather than means. Do we need to make any updates to the code? We already rank the outcomes, so the 50th percentile is there already. |
Beta Was this translation helpful? Give feedback.
-
We have 6 outcomes, some of which must be constructed from the LABSim output:
dhm
)dhm <= 24
)les_c4 == "EmployedOrSelfEmployed"
)laboursupplyweekly
to numerical)equivaliseddisposableincomeyearl
)atriskofpoverty == 1 || atriskofpoverty == null
)We construct 13 groups, and also include results for the whole population:
dgn == "Male"
)dgn == "Female"
)dag >= 25 && dag < 45
)dag >= 45 && dag < 65
)n_children_1-17 != 0
)n_children_1-17 == 0
or missing)les_c4 == "EmployedOrSelfEmployed"
)les_c4 == "NotEmployed"
)grp_emp == 1 && out_poverty == 1
)grp_emp == 0 && out_poverty == 1
)deh_c3 == "Low"
)deh_c3 == "Medium"
)deh_c3 == "High"
)We want:
Table structure:
Beta Was this translation helpful? Give feedback.
All reactions