-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create HRI Variables #778
Create HRI Variables #778
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I haven't tested in R and I couldn't obviously see where you might be getting duplicate rows.
One suggestion would be to use the new error checking on the joins (https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/#unmatched-rows)
For a lookup you can be really explicit and do something like:
left_join(
data,
lookup,
by = "lookup_var",
na_matches = "never",
unmatched = "drop",
relationship = "one-to-one")
I think the relationship
arg is the only none default one there. https://dplyr.tidyverse.org/reference/mutate-joins.html#arguments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
The duplicates in individual files may contribute to the issue. I am investigating it. #786 |
@lizihao-anu identified that the individual file has a number of duplicated rows, this looks to be (at least in part) because the SC client extract wasn't unique. Zihao is looking at this issue separately. I've checked the HRI code on a 'deduplicated' individual file (I just did |
This comment has been minimized.
This comment has been minimized.
@lizihao-anu Check this PR please, the test would be to check the numbers against the archived 1718 fst file in hscdiip |
As I check, if the duplicated rows in the 1718 individual file were eliminated, then there would not be a difference in row numbers between HRI and the individual file. It means the code is fine from this perspective. The new individual file produced by R for 1718 is around 0.11% rows than the old file produced by SPSS. So it should be cool! |
f417fbe
to
5e0fac2
Compare
This comment has been minimized.
This comment has been minimized.
Thanks Zihao! Great work, we can discuss this tomorrow! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, i think this is ready to merge!
@check-spelling-bot Report🔴 Please reviewSee the 📂 files view or the 📜action log for details. Unrecognized words (10)gls To accept ✔️ these unrecognized words as correct and remove the previously acknowledged and now absent words, run the following commands... in a clone of the [email protected]:Public-Health-Scotland/source-linkage-files.git repository curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/main/apply.pl' |
perl - 'https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5952010040/attempts/1' To have the bot do this for you, reply quoting the following line: Available 📚 dictionaries could cover words not in the 📘 dictionaryThis includes both expected items (220) from .github/actions/spelling/expect.txt and unrecognized words (10)
Consider adding them using (in with:
extra_dictionaries:
cspell:sql/src/tsql.txt
cspell:python/src/python/python.txt
cspell:python/src/python/python-lib.txt
cspell:latex/latex.txt
cspell:html/html.txt
cspell:elixir/elixir.txt
cspell:django/django.txt To stop checking additional dictionaries, add: with:
check_extra_dictionaries: '' If the flagged items are 🤯 false positivesIf items relate to a ...
|
* Bump `{slfhelper}` version The new version is needed to read the SLFs now. We use this in `get_existing_data_for_tests()` * Remove unnecessary code from `get_anon_chi` (#759) * remove unnecessary code from `get_anon_chi` `get_anon_chi` was updated in slfhelper v0.10 * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5669528966/attempts/1 Accepted in #759 (comment) Signed-off-by: check-spelling-bot <[email protected]> --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: marjom02 <[email protected]> Co-authored-by: Megan McNicol <[email protected]> * Set the default reporter for `tar_outdated()` and friends * Comment out dataset writing targets These take a very long time to run, so were skipped at the last update. They need to be revisited. * Make sure `year` is added as the first variable * Correct some documentation (#769) * Correct some documentation This resolves a build warning. * Style code --------- Co-authored-by: Moohan <[email protected]> * Make some changes suggested by lintr Lots of layout changes, as well as lots of implicit to explicit integer / double changes. * Document * Fix documentation typo * Investigate missing datazone from episode file (#773) * Format postcode into `pc7` format * Style code * Style code * Update documentation * Update comment in R/process_extract_ae.R * Implement catch-all for PC7 format --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> * Remove some obsolete code (#770) * Remove some obsolete code Renaming and removing some functions. * Style code --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Zihao Li <[email protected]> * Simplify `create_hscp_test_flags` (#772) * Simplify `create_hscp_test_flags` * Update documentation * Style code * simplify `create_hb_test_flags` * implement hscp test flags into tests * Simplify `create_demog_test_flags` --------- Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> * Rewrite case when statements (#780) * updated code from case_when to case_match as it's a bit easier to read * Style code * changed some more `case_when` to `case_match` * Style code * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5752014211/attempts/1 Accepted in #780 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Add tests for `convert_sending_location_to_lca` --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: marjom02 <[email protected]> Co-authored-by: SwiftySalmon <[email protected]> Co-authored-by: James McMahon <[email protected]> * Update R-CMD-check.yaml (#781) Co-authored-by: Jennit07 <[email protected]> * WIP - add sc client tests * Update documentation * New function `create_sending_location_test_flags` * Update documentation * Style code * Fix typo * Sort the data so `lag` works as expected * Use `""` in `select` instead of `.data$,,,` Also, use `dplyr::last_col()` to more generically specify the range. * Update documentation * added solve for hscp names (#789) In processed extract variable is called hscp, and in final SLF it's called hscp2018. Fixed with nested if statement Co-authored-by: marjom02 <[email protected]> * Fix locality (#802) Tiny error and a simple fix. Co-authored-by: Jennit07 <[email protected]> * Add simple scripts for running targets as a workbench job (#767) * Fix CHI duplicates of chi in individual file (#791) * fix duplicated matches in chi in sc data. * Update R/create_individual_file.R * update on join_sc_client * Create a test checking if individual files have duplicated chi * add duplicated chi number to the tests in process_tests_individual_file --------- Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: James McMahon <[email protected]> * Update NSU code for new 22/23 cohort (#784) Update `check_year_valid` for NSUs * Amend `get_boxi_extract_path` function for archiving DN and CMH data (#785) * Update `get_boxi_extract_path` for DN/CMH data * Remove extra function * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5856792420/attempts/1 Accepted in #785 (comment) Signed-off-by: check-spelling-bot <[email protected]> --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: James McMahon <[email protected]> * Fix increase in total preventable beddays (#779) * further obsolete code change * fix the preventable_beddays Co-authored-by: James McMahon <[email protected]> --------- Co-authored-by: James McMahon <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Add 2324 targets/workbench job file * Use `get_source_extract_path` in homelessness (#796) This was already set up, just not used for some reason. Note that this will switch from using a `.rds` to `.parquet` (unless you do `get_source_extract_path(year, "Homelessness", ext = "rds")`). Co-authored-by: Jennit07 <[email protected]> * Correct tests for NSU * fix warning on `:=` (#797) * fix warning on `:=` * Update R/aggregate_by_chi.R Co-authored-by: James McMahon <[email protected]> * Style code --------- Co-authored-by: James McMahon <[email protected]> Co-authored-by: lizihao-anu <[email protected]> * Update script for extracting NSU from SMRA space * Add 2324 targets/workbench job file * Use `get_source_extract_path` in homelessness (#796) This was already set up, just not used for some reason. Note that this will switch from using a `.rds` to `.parquet` (unless you do `get_source_extract_path(year, "Homelessness", ext = "rds")`). Co-authored-by: Jennit07 <[email protected]> * Correct tests for NSU * Update year in 99_NSU extract script * Update news for September 23 update (#811) * Update News for March and June updates * Update release date * WIP - update news for Sep update * Update NEWS.md Fix some typos / grammar --------- Co-authored-by: James McMahon <[email protected]> * Apply styling * Fix issue with `case_match` types (#810) * Fix issue with `case_match` types It seems that `case_match()` is stricter about types than `case_when()`. See the below code: ```r library(dplyr) # Breaks mutate(starwars, new_height = case_when( height == "172" ~ "170"), new_height2 = case_match( height, "172" ~ "170" ), .after = "height" ) # Works mutate(starwars, new_height = case_when( height == "172" ~ "170"), new_height2 = case_match( height, 172L ~ "170" ), .after = "height" ) ``` Since `sending_location` is an integer, the LHS of `case_match` must be numeric. It was slightly incorrect previously but `case_when` let us get away with it! I also updated and added to the tests. * Style code * Style code --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Bug - Outpatients tests failing due to missing HSCP (#816) * Update `produce_source_extract_tests` * Update outpatients tests with hscp_var = FALSE * Revert "Style code" This reverts commit 8e73d4a. * Style code * simplify code * Update documentation * Rename `hscp_var` to `add_hscp_count` * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> * fix read_sc_all_alarms_telecare with incorrect format in period (#814) * fix read_sc_all_alarms_telecare with the incorrect format in period --------- Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: James McMahon <[email protected]> * Fix `convert_sending_location_to_lca` example * Update script for extracting NSU from SMRA space * Update year in 99_NSU extract script * Update news for September 23 update (#811) * Update News for March and June updates * Update release date * WIP - update news for Sep update * Update NEWS.md Fix some typos / grammar --------- Co-authored-by: James McMahon <[email protected]> * Apply styling * Fix issue with `case_match` types (#810) * Fix issue with `case_match` types It seems that `case_match()` is stricter about types than `case_when()`. See the below code: ```r library(dplyr) # Breaks mutate(starwars, new_height = case_when( height == "172" ~ "170"), new_height2 = case_match( height, "172" ~ "170" ), .after = "height" ) # Works mutate(starwars, new_height = case_when( height == "172" ~ "170"), new_height2 = case_match( height, 172L ~ "170" ), .after = "height" ) ``` Since `sending_location` is an integer, the LHS of `case_match` must be numeric. It was slightly incorrect previously but `case_when` let us get away with it! I also updated and added to the tests. * Style code * Style code --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Bug - Outpatients tests failing due to missing HSCP (#816) * Update `produce_source_extract_tests` * Update outpatients tests with hscp_var = FALSE * Revert "Style code" This reverts commit 8e73d4a. * Style code * simplify code * Update documentation * Rename `hscp_var` to `add_hscp_count` * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> * fix read_sc_all_alarms_telecare with incorrect format in period (#814) * fix read_sc_all_alarms_telecare with the incorrect format in period --------- Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: James McMahon <[email protected]> * Fix `convert_sending_location_to_lca` example * Rename sheet name Co-authored-by: James McMahon <[email protected]> * fix warning on `:=` (#797) * fix warning on `:=` * Update R/aggregate_by_chi.R Co-authored-by: James McMahon <[email protected]> * Style code --------- Co-authored-by: James McMahon <[email protected]> Co-authored-by: lizihao-anu <[email protected]> * Add new function for counting records by sending location (#782) * New function `create_sending_location_test_flags` * Update R/process_tests_sc_client_lookup.R * Update documentation * Update ref functions for Sep23 update * Updates target year for new `2324` file * add homelessness flags to the episode file. (#815) * still sorting out conflicts * Update documentation * Style code * # Working code for homelessness flags - removes applications that have a missing end date in line with the homelessness work - `group_by` and `summarise(max)` to deal with people that have more than one application to prevent case duplication * Update documentation * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update documentation * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update documentation * Style code * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Style code * Update documentation * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update documentation * Style code * made James' suggested changes and generally tidied up with lintR. Replaced `anon_chi` with `chi` * Style code * Update documentation * # Added homelessness flags to episode and individual files, and targets. finishing off code to add flags to files * Style code * Update documentation * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Update documentation * Style code * Update R/process_lookup_homelessness.R Co-authored-by: James McMahon <[email protected]> * Style code * Update documentation * # Updated documentation added homelessness params to the episode and individual file --------- Co-authored-by: marjom02 <[email protected]> Co-authored-by: SwiftySalmon <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Create HRI Variables (#778) * New functions to determine Scottish residents and add HRI variables * Test to check that non-Scottish residents are flagged correctly * Style code * use `.default =` instead of `TRUE ~` * Make doubles specific * Update documentation * Pass the `slf_postcode_lookup` as data instead of the path * Update documentation * Use Logical `TRUE` * Remove parameter `chi_variable` * rename `keep_flag` to `scottish_resident` --------- Co-authored-by: shintoLamp <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennifer Thom <[email protected]> * Remove future from the cohorts code (#821) Co-authored-by: Jennit07 <[email protected]> * Use `col_select` instead of `columns` in tests * Use `col_select` instead of `columns` in tests * Remove `slfhelper::get_chi()` Takes too long to process tests * Remove `slfhelper::get_chi` * Add demographic flags for anon chi * Only use hscp flags if recid is NOT 00B * Update documentation * Remove rename of hscp - causes mismatched tests * Include `hscp2018` in variable list This was causing an issue in tests * Add HRI function to individual file pipeline * Revert "BUG - Fix for the episode file tests (#823)" This reverts commit bf5f9b0. * Revert "Remove rename of hscp - causes mismatched tests" This reverts commit 628da4d. * Update NEWS.md * Update `check_year_valid` for 23/24 social care * Update create_individual_file.R (#824) * Update create_individual_file.R * Update documentation * Style code --------- Co-authored-by: lizihao-anu <[email protected]> * Lizihao anu patch 1 (#825) * Update create_individual_file.R * Update add_hri_variables.R * Update documentation * Style code * move chi variable after `data` * move `add_hri_variables` up list and add pipe * Update documentation --------- Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: Jennifer Thom <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Add HRI last * Add empty SC variables for 2324 - ep file * Add SC variables for 2324 - indiv file * add condition for aggregating ch episodes * New function for aggregating without sc variables * specify `chi_variable` in hri select list This was causing an error * Implement new aggregate for sc variables * Remove `dplyr::everything()` * Declare social care variables for latest year * Style code * Update documentation * Include hl1 variables in aggregate * Revert "Include hl1 variables in aggregate" This reverts commit 562869a. * Fix aggregation_by_chi (#829) * test commit push * Revert "test commit push" This reverts commit ff51536. * combine aggregate_by_chi with sc and without sc * Style code * Update documentation --------- Co-authored-by: lizihao-anu <[email protected]> * Update NEWS.md * Only use hscp test flags if NOT 00B * BUG - Fix episode file tests (#830) * add anon chi parameter in `get_existing_data..` * remove `slfhelper::get_chi()` * Use `anon_chi` parameter * declare `anon_chi` tests * Update documentation * update documentation * Style code * remove hscp count for now * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> * Add tests for `compute_mid_year_age` (#809) * Add tests for `compute_mid_year_age` * Remove redundant code * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Add a new function to set up keyring (#800) * Add a new function to set up keyring I've tested this by deleting my `.Renviron` and deleting my keyring `keyring::keyring_delete("createslf")` and it seems to work. Would be great to have someone with an existing set-up (Jen) test it, and to have someone who doesn't have it set up to test it. The code looks complicated but I've just tried to catch every scenario, so the process should be smooth and clear (from the user's point of view). I've also expanded the code relating to the username, which will now hopefully work in more cases. * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5824423711/attempts/1 Accepted in #800 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Update documentation --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Add additional tests for `get_file_path` (#808) * Add additional tests for `get_file_path` * Style code --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Rename `run_episode_file()` -> `create_episode_file()` (#803) * Rename `run_episode_file()` -> `create_episode_file()` This improves consistency! When speaking to Megan we noted that having the two 'main' functions with different names was needlessly confusing! * Delete run_targets_tests.R * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Megan McNicol <[email protected]> * Remove incorrect references to rds (#798) * Remove incorrect references to rds Since we (mostly) don't use rds anymore these references are incorrect and potentially confusing. I've updated lots of documentation to remove the reference to rds. I've also updated many comments that mentioned rds (these were probably the most confusing). * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Megan McNicol <[email protected]> * Make targets and tarchetypes required packages (#799) Co-authored-by: Megan McNicol <[email protected]> * Update episode file functions to pass data through (#754) * Update `read_file` to return an empty tibble if passed the dummy path This is needed for some other bits, notably NSUs * Update SPARRA and HHG paths to return dummy if the year is invalid * Extract all data as a parameter * Style code * Update documentation * Style code * Update documentation * rename `run` to `create_episode_file` * Update documentation --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennifer Thom <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Tests/it extract path (#807) * Add additional tests for `check_it_reference()` * Make the check on the IT reference stricter * Update documentation --------- Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennit07 <[email protected]> * Update workflow to run against the development branch (#795) * Make test-coverage.yaml run against development * Make lint-changed-files.yaml run against development --------- Co-authored-by: Jennit07 <[email protected]> * Update `create_episode_file` * remove `run_episode_file` * Update documentation * new function for `client_lookup_path` * rename to `get_sc_lookup_paths` * Update documentation * Use `get_sc_client_lookup_path` * Update client tests * Update documentation * Style code * add sc client tests to targets * remove client path from `get_source_extract_path` * Update documentation --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Megan McNicol <[email protected]> Co-authored-by: marjom02 <[email protected]> Co-authored-by: Megan McNicol <[email protected]> Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Zihao Li <[email protected]> Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: Bateman McBride <[email protected]> Co-authored-by: shintoLamp <[email protected]>
I have written some code to add the HRI variables to the individual file, but I need someone to check the functionality.
The main issue is that the individual file I was using for testing (2019/20) has 5,983,038 rows, and when the HRIs are matched on there's 5,986,774. So we've gained 3736 rows. I imagine that the
left_join
at the end ofadd_hri_variables
is doing something unexpected.Could someone run through the code please and try and figure out what's going on?