Create individual file (#715)

* Until L594 * Converted until L677 * Until L731 * Update documentation * Remove test ref * Style code * WIP writing functions to fill postcode in line with previous DOB functions * Update documentation * implement quick fix for running 22/23 * Style code * Fix missed comma * Exclude DD code for now - TEMP fix * Correct/rename variables * Style code * Include NSU in `check_year_valid` * Update `check_year_valid_tests` * Update documentation * Update `add_nsu_cohort` to pick up years valid * Style code * remove extra `!` * Exclude `cij_delay` * Style code * improve `max_no_inf()` * Use pmin/max instead of `rowwise` * improve `min_no_inf()` * Use n_distinct(cij_marker) * deal with distinct(ch_chi_cis) * use n_distinct(ooh_case_id) * remove `find_non_duplicates` * Use dplyr::if_else() Co-authored-by: James McMahon <[email protected]> * Fix typo in `ooh_covid_assessment` * Move `ooh_case_id` to aggregate * Use `slfhelper::ltc_vars` * Remove `clean_up_dob` Already done in `correct_demographics` * Update documentation * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/4981058958/attempts/1 Accepted in #654 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Use `start_next_fy_quarter` in place of rowwise * Style code * Use `compute_mid_year_age` * convert code into data.table for improving speed * Update `get_fy_dates`function * remove `date_from_fy`, use `get_fy_dates` * Update documentation * Remove `clean_up_postcode` function Not needed anymore * Remove non duplicates function/move to aggregate * Style code * Update documentation * Add time stamps to `create_individual_file` * Style code * remove `clean_up_postcode` * Deal with ch cis episodes * Style code * add .data$ * Turn ch aggregate into a data table * Style code * use ch_chi_cis * remove `preventable_admissions` from aggregate * exclude `hh_in_fy` for now * Style code * Test - exclude `sc_` vars from aggregate * Style code * Exclude for now * exclude for now * Style code * automate `check_year_valid` * Return dummy file path for NSU not valid * Style code * Fix brackets in aggregate * TEMP - exclude variables * Use `phsmethods::sex_from_chi` * Style code * Add ungroup() * lowercase dob * Remove as.data.table * rewrite aggregate_by_chi with data.table * Style code * minor changes * Use the updated function * to properly import data.table * remove redundant columns dob postcode and gpprac * minor changes to remove redundant postcode gpprac columns * Style code * rename columns with small letters * Style code * newaggregate_ch_episodes * Update documentation * add functions to replace regular expressions to select column/variables * Update documentation * Style code * minor changes * add a missing variable, cij_delay * Style code * add variables cij_delay, preventable_beddays * add missing variables health_net_cost, health_net_costincdnas, and cmh, dd sds columns * Style code * add more variables needed * Style code * Update R/link_delayed_discharge_eps.R * Style code * amend costs * Style code * Revert "amend costs" This reverts commit 8048e68. * Add DN and cij_delay back in * fix the issue * Style code * remove running in chunks * Style code * Update tests to include missing variables * Remove unnecessary comma * fix the bug of preventable_beddays * Update documentation * fix total ae_attendances * fix the bug of preventable_admissions * fix the bug of hbrescode etc * minor fix * minor fix * Style code * Fix some warnings being produced by the tests * Fix failing test * remove running in chunks * Style code * Update the targets config to use `timestamp_positives` as the default reporter * fix the bug of preventable_beddays * Update documentation * fix total ae_attendances * fix the bug of preventable_admissions * fix the bug of hbrescode etc * minor fix * minor fix * Style code * fix home care cost * add ipdc to fix maternity * fix preventable addmission and care home cost * fix preventable_admissions and calculate preventable_beddays here * add monthly_beddays and yearstay to dd * Style code * fix preventable_admissions and preventable_beddays * Style code * include parameter for write to disk/year * Add lookups to indiv file creation pipeline * include parameter for write to disk/year * fix delay discharge beddays and yearstay * Style code * fix preventable issues * Style code * fix the issue of preventable stuff * Style code * Update R/aggregate_by_chi_zihao.R * Update documentation * Fix minor typos * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5443581387/attempts/1 Accepted in #709 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Remove some obsolete comments * Remove some unnecessary brackets * Reformat some code * Use some `dplyr` functions for readability * Style code * Update R/link_delayed_discharge_eps.R * Style code * Remove some code which is no longer needed We now match on these variables after * Work out preventable admissions with similar indicators * Lowercase variable names * Restore `cij_delay` * Restore DN variables * Tidy the code and use integers where possible * Supply `year` as a parameter to `clean_up_ch` * Supply `year` as a parameter to `clean_individual_file` * Only keep required variables to save memory * Rename the parameter so the documentation works * Use `setnames` to change names to lower * Remove unneeded code * Update file path name * Trim the return code * Some fixes * Correctly compute `ooh_cases` * Update documentation * Style code * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5466392495/attempts/1 Accepted in #719 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Add targets for the individual file * Fix missed pipe * Style code * Update some targets to only run once a week * Make the deaths lookup unique * Add `year` back to the individual file * Remove `cost_total_net_inc_dnas` from the indiv file (#737) * Drop `cost_total_net_inc_dnas` * Rename `health_net_costincdnas` to `health_net_cost_inc_dnas` * Join slf lookups onto individual file (#724) * Create function for matching on slf lookups * fix some build warnings * Add `hbrescode` to select list * Pass lookups as parameters/deal with hbrescode * Update R/create_individual_file.R --------- Co-authored-by: James McMahon <[email protected]> * Join sc client variables onto individual file (#740) * New function for matching sc client to indiv file * Style code * [check-spelling] Update metadata Update for https://github.com/Public-Health-Scotland/source-linkage-files/actions/runs/5555048903/attempts/1 Accepted in #740 (comment) Signed-off-by: check-spelling-bot <[email protected]> * Code layout * Style code * Remove redundant sc variables Co-authored-by: James McMahon <[email protected]> * Update comments Co-authored-by: James McMahon <[email protected]> * Update comments Co-authored-by: James McMahon <[email protected]> * Sort order of parameters to pass `data` first * Update documentation * Style code * Update R/create_individual_file.R * Update R/create_individual_file.R * Update R/create_individual_file.R * Style code --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: James McMahon <[email protected]> Co-authored-by: Moohan <[email protected]> * Update documentation * Output the individual file with `anon_chi` (#748) * Make episode file output with `anon_chi` I've added this as a parameter so you can output CHI if desired, but the default is for anon_chi. For the tests, it swaps back to CHI as there are some tests which specifically us the CHI number. * Output `anon_chi` in the individual file * Style code * Sort variables with issues `hbrescode` (HB2018), `datazone` and `hscp` (#746) * rename `hscp` to `hscp2018` * rename `spd` as `slf_pc_lookup` * Add `datazone2011` to coalesce code * Rename `datazone` to `datazone2011` * include `datazone2011_old` in selections * Update R/fill_geographies.R --------- Co-authored-by: James McMahon <[email protected]> * Fix for anon_chi being NA --------- Co-authored-by: Moohan <[email protected]> Co-authored-by: Jennit07 <[email protected]> --------- Signed-off-by: check-spelling-bot <[email protected]> Co-authored-by: Mandy Norrbo <[email protected]> Co-authored-by: jr-mandy <[email protected]> Co-authored-by: shintoLampgit config --global user.email [email protected] git config --global user.name shintoLamp <[email protected]> Co-authored-by: shintoLamp <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Jennifer Thom <[email protected]> Co-authored-by: Jennit07 <[email protected]> Co-authored-by: Zihao Li <[email protected]> Co-authored-by: lizihao-anu <[email protected]> Co-authored-by: Moohan <[email protected]> Co-authored-by: Zihao Li <[email protected]>
Public-Health-Scotland · Jul 19, 2023 · 8db3769 · 8db3769
1 parent 74109bf
commit 8db3769
Show file tree

Hide file tree

Showing 45 changed files with 1,818 additions and 16 deletions.
diff --git a/.github/actions/spelling/expect.txt b/.github/actions/spelling/expect.txt
@@ -28,6 +28,7 @@ cmh
 CNWs
 commhosp
 congen
+costincdnas
 costmonthnum
 costsfy
 covr
@@ -45,6 +46,7 @@ dbconnect
 dbplyr
 deathdiag
 demog
+dfc
 disch
 dischloc
 dischto
@@ -70,6 +72,7 @@ fyyear
 geogs
 ggplot
 GLS
+gls
 gms
 GPOo
 gpprac
@@ -86,6 +89,7 @@ hhg
 hjust
 hms
 homecare
+homev
 hscp
 hscpnames
 IDPC
@@ -102,6 +106,8 @@ keyring
 keytime
 keytimex
 kis
+lgl
+kis
 los
 ltc
 ltcs
@@ -116,6 +122,7 @@ multiday
 multisession
 multistaff
 NAs
+newcons
 nhs
 nhshosp
 NRS
@@ -147,7 +154,9 @@ purrr
 quickstart
 Rbuildignore
 rcmdcheck
+rdd
 rds
+reabl
 reablement
 readcode
 readr
@@ -164,8 +173,12 @@ rspm
 RStudio
 rstudioapi
 Rtype
+SDcols
 seealso
 selfharm
+setkeyv
+setnafill
+setnames
 Siar
 sigfac
 simd

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -55,7 +55,8 @@ Imports:
     stringr (>= 1.5.0),
     tibble (>= 3.2.1),
     tidyr (>= 1.3.0),
-    tidyselect (>= 1.2.0)
+    tidyselect (>= 1.2.0),
+    zoo (>= 1.8.0)
 Suggests:
     covr (>= 3.6.1),
     roxygen2 (>= 7.2.3),

diff --git a/NAMESPACE b/NAMESPACE
@@ -13,6 +13,7 @@ export(convert_hscp_to_hscpnames)
 export(convert_numeric_to_date)
 export(convert_sending_location_to_lca)
 export(convert_year_to_fyyear)
+export(create_individual_file)
 export(create_service_use_cohorts)
 export(end_fy)
 export(end_fy_quarter)
@@ -160,6 +161,8 @@ export(start_fy)
 export(start_fy_quarter)
 export(start_next_fy_quarter)
 export(write_file)
+importFrom(data.table,.N)
+importFrom(data.table,.SD)
 importFrom(magrittr,"%>%")
 importFrom(readr,col_character)
 importFrom(readr,col_date)

diff --git a/R/aggregate_by_chi_zihao.R b/R/aggregate_by_chi_zihao.R
@@ -0,0 +1,215 @@
+#' Aggregate by CHI
+#'
+#' @description Aggregate episode file by CHI to convert into
+#' individual file.
+#'
+#' @importFrom data.table .N
+#' @importFrom data.table .SD
+#'
+#' @inheritParams create_individual_file
+aggregate_by_chi_zihao <- function(episode_file) {
+  cli::cli_alert_info("Aggregate by CHI function started at {Sys.time()}")
+
+  # Convert to data.table
+  data.table::setDT(episode_file)
+
+  # Ensure all variable names are lowercase
+  data.table::setnames(episode_file, stringr::str_to_lower)
+
+  # Sort the data
+  data.table::setkeyv(
+    episode_file,
+    c(
+      "chi",
+      "record_keydate1",
+      "keytime1",
+      "record_keydate2",
+      "keytime2"
+    )
+  )
+
+  data.table::setnames(
+    episode_file,
+    c(
+      "ch_chi_cis", "cij_marker", "ooh_case_id"
+      # ,"hh_in_fy"
+    ),
+    c(
+      "ch_cis_episodes", "cij_total", "ooh_cases"
+      # ,"hl1_in_fy"
+    )
+  )
+
+  # column specification, grouped by chi
+  # columns to select last
+  cols2 <- c(
+    "postcode",
+    "dob",
+    "gpprac",
+    vars_start_with(episode_file, "sc_")
+  )
+  # columns to count unique rows
+  cols3 <- c(
+    "ch_cis_episodes",
+    "cij_total",
+    "cij_el",
+    "cij_non_el",
+    "cij_mat",
+    "cij_delay",
+    "ooh_cases",
+    "preventable_admissions"
+  )
+  # columns to sum up
+  cols4 <- c(
+    vars_end_with(
+      episode_file,
+      c(
+        "episodes",
+        "beddays",
+        "cost",
+        "attendances",
+        "attend",
+        "contacts",
+        "hours",
+        "alarms",
+        "telecare",
+        "paid_items",
+        "advice",
+        "homev",
+        "time",
+        "assessment",
+        "other",
+        "dn",
+        "nhs24",
+        "pcc",
+        "_dnas"
+      )
+    ),
+    vars_start_with(
+      episode_file,
+      "sds_option"
+    ),
+    "health_net_cost_inc_dnas"
+  )
+  cols4 <- cols4[!(cols4 %in% c("ch_cis_episodes"))]
+  # columns to select maximum
+  cols5 <- c("nsu", vars_contain(episode_file, c("hl1_in_fy")))
+  data.table::setnafill(episode_file, fill = 0L, cols = cols5)
+  # compute
+  individual_file_cols1 <- episode_file[,
+    .(gender = mean(gender)),
+    by = "chi"
+  ]
+  individual_file_cols2 <- episode_file[,
+    .SD[.N],
+    .SDcols = cols2,
+    by = "chi"
+  ]
+  individual_file_cols3 <- episode_file[,
+    lapply(.SD, function(x) {
+      data.table::uniqueN(x, na.rm = TRUE)
+    }),
+    .SDcols = cols3,
+    by = "chi"
+  ]
+  individual_file_cols4 <- episode_file[,
+    lapply(.SD, function(x) {
+      sum(x, na.rm = TRUE)
+    }),
+    .SDcols = cols4,
+    by = "chi"
+  ]
+  individual_file_cols5 <- episode_file[,
+    lapply(.SD, function(x) max(x, na.rm = TRUE)),
+    .SDcols = cols5,
+    by = "chi"
+  ]
+  individual_file_cols6 <- episode_file[,
+    .(
+      preventable_beddays = ifelse(
+        max(cij_ppa, na.rm = TRUE),
+        max(cij_end_date) - min(cij_start_date),
+        NA_real_
+      )
+    ),
+    # cij_marker has been renamed as cij_total
+    by = c("chi", "cij_total")
+  ]
+  individual_file_cols6 <- individual_file_cols6[,
+    .(
+      preventable_beddays = sum(preventable_beddays, na.rm = TRUE)
+    ),
+    by = "chi"
+  ]
+
+  individual_file <- dplyr::bind_cols(
+    individual_file_cols1,
+    individual_file_cols2[, chi := NULL],
+    individual_file_cols3[, chi := NULL],
+    individual_file_cols4[, chi := NULL],
+    individual_file_cols5[, chi := NULL],
+    individual_file_cols6[, chi := NULL]
+  )
+
+  # convert back to tibble
+  return(dplyr::as_tibble(individual_file))
+}
+
+
+#' select columns ending with some patterns
+#' @describeIn select columns based on patterns
+vars_end_with <- function(data, vars, ignore_case = FALSE) {
+  names(data)[stringr::str_ends(
+    names(data),
+    stringr::regex(paste(vars, collapse = "|"),
+      ignore_case = ignore_case
+    )
+  )]
+}
+
+#' select columns starting with some patterns
+#' @describeIn select columns based on patterns
+vars_start_with <- function(data, vars, ignore_case = FALSE) {
+  names(data)[stringr::str_starts(
+    names(data),
+    stringr::regex(paste(vars, collapse = "|"),
+      ignore_case = ignore_case
+    )
+  )]
+}
+
+#' select columns contains some characters
+#' @describeIn select columns based on patterns
+vars_contain <- function(data, vars, ignore_case = FALSE) {
+  names(data)[stringr::str_detect(
+    names(data),
+    stringr::regex(paste(vars, collapse = "|"),
+      ignore_case = ignore_case
+    )
+  )]
+}
+
+#' Aggregate CIS episodes
+#'
+#' @description Aggregate CH variables by CHI and CIS.
+#'
+#' @inheritParams create_individual_file
+aggregate_ch_episodes_zihao <- function(episode_file) {
+  cli::cli_alert_info("Aggregate ch episodes function started at {Sys.time()}")
+
+  # Convert to data.table
+  data.table::setDT(episode_file)
+
+  # Perform grouping and aggregation
+  episode_file <- episode_file[, `:=`(
+    ch_no_cost = max(ch_no_cost),
+    ch_ep_start = min(record_keydate1),
+    ch_ep_end = max(ch_ep_end),
+    ch_cost_per_day = mean(ch_cost_per_day)
+  ), by = c("chi", "ch_chi_cis")]
+
+  # Convert back to tibble if needed
+  episode_file <- tibble::as_tibble(episode_file)
+
+  return(episode_file)
+}