-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathREADME.Rmd
1590 lines (1149 loc) · 107 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: github_document
editor_options:
chunk_output_type: console
bibliography: ../Research-Provider_Networks/manuscripts/networks-r01-bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
suppressWarnings(suppressMessages(source(here::here("R/manifest.R"))))
s3 <- paws::s3()
s3$list_objects(Bucket = "govgptco")
source(here("R/map-theme.R"))
source(here("R/shared-objects.R"))
source(here("R/theme-tufte-revised.R"))
```
# Defining Markets for Health Care Services
The objective of this repository is to lay out some thoughts, analytics, and data for defining geographic markets for health care services. In other words, it is a guided tour of a particularly complex rabbit hole.
Geographic market definitions are important for a variety of regulatory and research applications. Therefore, for any given use (e.g., analyses of a health system or hospital merger) or measure (e.g., constructing a Herfindahl-Hirschman index of market concentration) it is important to know whether and how the analytic output varies by alternative market definitions.
For example, suppose our goal is to characterize insurers, hospitals or other providers by whether they operate in a concentrated market. If we use a market geography definition that is too narrow (e.g., county) we risk mischaracterizing markets as "concentrated" when they are really not (i.e., Type I error). Alternatively, a market definition that is too broad (e.g., state) risks characterizing markets as competitive when in practice a hypothetical merger or market exit could materially affect prices and competitiveness (i.e., Type II error).
Not surprisingly given the above issues, commonly used market geographies have trade-offs. Whether the strengths outweigh the weaknesses for a given application will depend on the specific research or regulatory question at hand.
There are other important considerations at play as well. For example, some market definitions are constrained by geopolitical boundaries (e.g., state borders). While this may be fine for some settings (e.g., rate regulation in insurance markets, since consumers can only purchase a plan offered in their market) it may not be for others (e.g., hospitial markets, in which patients are unconstrained from crossing state boundaries).
In addition, the underlying population data used to define commonly used geographic markets is out of date. The latest offical commuting zone boundaries are derived from commuting patterns ascertained in the 2000 Census, though researchers have [updated]( https://sites.psu.edu/psucz/data/) these boundaries based on 2010 data. HRRs and HSAs, by comparison, are defined by patient flows to hospitals in 1992 and 1993.
Clearly, flows of patients and commuters have changed substantially in many areas in the last 20-30 years. Whether these changes are material to defining geographic boundaries of contemporary health care markets remains an open question we will explore here.
Finally, it is worth mentioning that regulatory and antitrust reviews have drawn on a diverse set of additional market geography definitions. The history, use and controversies surrounding these definitions are nicely covered in the Department of Justice chapter entitled ["Competition Law: hospitals."](https://www.justice.gov/atr/chapter-4-competition-law-hospitals)
These alternative DOJ market definitions tend to rely on rich data on prices in health care markets. While in theory such information could be obtained nationwide, in practice the construction of market definitions using price data is contigent on the painstaking collection of local data from relevant market participants. I do not profess to have the human capital or funding resources to undertake such an exercise here. So we will focus on more general market geography definitions that can more easily scale--particularly using publicly-available and relatively low-cost data.
# Geographic Market Definitions
Before moving on it is useful to put down, in one place, common methods and definitions used to define geographic markets.
## Hospital Service Areas (HSA)
HSAs are defined by the hospital care patterns of fee-for-service Medicare beneficiaries. Specifically, a three-step process is used:
1. Define all general acute care hospitals in the U.S. The town or city of the hospital location becomes the basis for HSA naming. Thus, if a given town has more than one hospital, those hospitals would be considered as part of the same HSA. In practice most HSAs end up with one hospital, however.
2. Aggregate all Medicare visits to the hospital (or hospitals, in cases where towns or cities have > 1 hospital). Using a plurality rule, assign ZIP codes to the HSA name where the most of its residents receive hospital care.
3. Curate the HSA assignments to assure that only contiguous ZIPs make up the HSA.
In total there are 3,436 HSAs in the United States.
According to the Dartmouth methods appendix, data from 1992-93 were used to construct HSA boundaries. However, crosswalks from ZIP code tabulation area (ZCTA) to HSA are available on the Dartmouth website through 2017. Since 3,436 unique HSAs appear in the latest (2017) crosswalk this suggests that the updates only pertain to ZCTA updates, rather than updates on the geographies of the underlying HSAs.
## Hospital Referral Regions (HRR)
Whereas HSAs are intented to capture the geographic catchment area where residents of a ZIP code receive most of their overall hospital services, HRRs are meant to capture larger teritary referral areas.
To identify HRRs, Dartmouth researchers aggregated HSAs into contiguous geographies based on where residents of the HSA received the most cardiovascular procedures and neurosurgeries. Thus, HSAs serve as the basic building block of HRRs. HRRs are also constructed to meet the following criteria:
- Population of at least 120,000.
- At least 65% of residents' services occurred within the region.
- Comprised of geographically contiguous HSAs.
In cases where the above criteria were not met, neighboring areas were pooled together until all criteria were satisfied. There are 306 HRRs in the United States.
## Primary Care Service Areas
PCSAs are intented to serve as the analogue of HSAs for primary care services. Thus, a PCSA is defined as a collection of contiguous ZIP codes with at least one primary care provider, and where the plurality of primary care services is obtained among fee-for-service Medicare beneficiaries.
There are 6,542 PCSAs in the U.S. -- or roughly double the number of HSAs. On average there are 4.9 ZCTAs in a PCTA (median =3, max = 81, min = 1). 61% of primary care services, on average, are obtained within PCSAs.
## Marketplace Rating Areas
Marketplace rating areas are geographically contiguous areas used for the purpopses of insurance plan rate setting in the non-group market. The default geography used to set rating areas is the Metropolitan Statistical Area (MSA) plus the remainder of the state not in an MSA (MSA+1 definition). However, states have the option to define alternative county, 3-digit ZIP, or MSA/non-MSA clusters if they deem some alternative definition more important for regulation and rate setting within the state.
In practice, only 7 states (AL, NM, ND, OK, TX, VA, WY) went along with the default (MSA+1) standard. The vast majority of the states submitted clusterings of counties as their rating area definitions. Another handful of states (MA, NE, AK) uses clusters of 3-digit ZIP codes, while CA uses a combination of counties and 3-digit ZIPs. Specifically, LA county is split into two rating areas based on 3-digit ZIP, while the remainder of the state is apportioned into rating areas based on county boundaries.
What this means is there is signficant heterogeneity across states in the geographic and population size of rating areas. South Carolina, for example, has 46 rating areas -- more than *double* the 19 rating areas that define California!
## Commuting Zones
Commuting zones are comprised of geographically contiguous counties with strong within-area clustering of commuting ties between residental and work county, and weak across-area ties. The [latest official commuting zone geography files](https://www.ers.usda.gov/data-products/commuting-zones-and-labor-market-areas/) are based on patterns observed in the 2000 census. However, more recent county-to-county commuting data are available based on the 2009-2013 American Community Survey (ACS) and could be used to construct new commuting zone geographies.
For now, the zones used here will draw on the shapefiles constructed based on 2010 Census data by researchers at [Penn State](https://sites.psu.edu/psucz/).
This description of the history and methods of commuting zones from the U.S. Department of Agriculture (USDA) is useful:
> The ERS Commuting Zones (CZs) and Labor Market Areas (LMAs) were first developed in the 1980s as ways to better delineate local economies. County boundaries are not always adequate confines for a local economy and often reflect political boundaries rather than an area's local economy. CZs and LMAs are geographic units of analysis intended to more closely reflect the local economy where people live and work. Beginning in 1980 and continuing through 2000, hierarchical cluster analysis was used along with the Census Bureau's journey to work data to group counties into these areas. In 2000, there were 709 CZs delineated for the U.S., 741 in 1990, and 768 in 1980. LMAs are similar to CZs except that they had to have a minimum population of 100,000 persons. LMAs were only estimated in 1980 and 1990. This was done in order for the Census Bureau to create microdata samples using decennial census data (1980 PUMS-D, 1990 PUMS-L) that avoided disclosure. The LMAs were discontinued in 2000 because researchers found them to be too large and not as useful as the CZs. The identical methodology was used to develop CZs for all three decades.
# Market Definitions Based on Community Detection
Measures of economic activity (e.g., patient flows, predicted demand, prices) among market participants form the essential building blocks for defining health care markets. These linkages can be combined into a "network" summarizing the strength of economic linkages between relevant market units--for example, linkages between individuals / ZIP codes and their health care providers. This network, moreover, can be used to identify commonalities *within* market units. For instance, for defining geographic markets we might be interested in identifying clusters of ZIP codes that draw upon a common set of hospitals. Alternatively, we may be interested in clustering *hospitals* into groups based on economic ties among them (e.g., markets of "competing" hospitals that draw patients from similar ZIP codes).
Identification of clusters of linked "nodes" in a network is known as **community detection.** A variety of community detection algorithms have been developed across diverse fields ranging from physics, biology, and sociology. As we discuss below, many common approaches to defining health care markets--including HSAs, HRRs, commuting zones, and even the hypothetical monopolist test used in antitrust reviews--are essentially community detection methods.
In this section, we outline an [ensemble](https://arxiv.org/abs/1309.0242) network-analytic approach to defining hospital markets based on community detection. This approach fits several different algorithms and then aggregates the information they produce to improve market definitions.
We focus on a single example of ZIP codes and hospitals within Philadelphia County, Pennsylvania. As we will show later, with appropriate data the methodology is easily scalable nationally and to other geographies and health care service types (e.g., physicians, insurers). An interesting path for future work might also look at how markets differ by patient sub-populations defined by disease/condition, service use type (e.g., emergency care vs. elective surgery), income, or insurance type.
## Some Caveats and Ideas for Future Research
It is important to emphasize that the approach we articulate is not intended as a substitute for market definitions guided by economic theory. Rather, we provide a novel *analytic framework* for defining markets based on data summarizing economic links among market participants. In other words, just as the multinomial logit provides an analytic framework for estimating consumer demand, our network analytic approach provides an analytic framework for detecting markets. As noted above, a nice feature of the approach is that common methods for defining markets (e.g., HSA, the hypothetical monopolist test, etc.) can all be seen as special cases of community detection.
Once understood in that sense, it is straightforward to see that our framework easily accommodates measures of economic linkages motivated by theory, produced via estimation of (exogenous) demand, or both. In our examples below, these "linkages" are summarized as patient counts of the number of fee-for-service Medicare patients from each ZIP code who are treated at local hospitals. We recognize that these linkages are observational and subject to endogeneity concerns, which will affect the market definitions identified in the example. However, it is important to recognize that any economic "linkage" measure (e.g., patient demand estimated using exogenous variation, unit price correlations among competing hospitals) could be plugged in as the relevant measure of an economic connection among market units.
Finally, as noted above and in a similar vein, our data on patient flows is drawn from [publically-available](https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Hospital-Service-Area-File/index.html) hospital service data from CMS. These data summarize *overall* patient flows to acute care hospitals among fee-for-service Medicare patients. But in principle, data on other patient populations--defined by service type (e.g., emergency patients, cardiovascular patients, etc.) or population (e.g., low-income patients, commercially-insured patients) could be used. That is, data on patient sub-populations could be used to partition geographies or hospitals into different markets. Future research on how market definitions vary for different patient populations would be quite useful.
```{r, echo = FALSE}
# St. Joseph's Hospital in Philadelphia closed March 11, 2016. North Philadelphia Health System closed the 146-bed hospital as it consolidated operations to help improve its finances. North Philadelphia Health System filed for bankruptcy Dec. 30.
north_phila <- "390132"
north_phila_zip <- "19130"
zip1 <-19116
zip2 <- 19130
```
```{r, echo = FALSE}
library(ggrepel)
# Load shape files
sf_state <- read_sf(here("output/tidy-mapping-files/state/01_state-shape-file.shp")) %>%
st_transform(crs = 4326)
sf_county <- read_sf(here("output/tidy-mapping-files/county/01_county-shape-file.shp")) %>%
st_transform(crs = 4326)
sf_zip <- read_sf(here("output/tidy-mapping-files/zcta/01_zcta-shape-file.shp")) %>%
st_transform(crs = 4326)
# Centroids of each ZIP
cent_zip <- sf_zip %>% st_centroid()
# Just excounty County, TN
sf_excounty_alone <-
sf_county %>%
filter(fips_code=="42101")
# Zips with centroids within excounty County, TN
zips_to_show <- cent_zip %>%
filter(row_number() %in% unlist(st_intersects(sf_excounty_alone, cent_zip))) %>%
pull(zcta5ce10)
# Shapefile of these ZIPS
sf_zip_excounty <- sf_zip %>%
filter(zcta5ce10 %in% zips_to_show)
# Centroids of these ZIPs
cent_zip_excounty <-
sf_zip_excounty %>%
st_centroid()
# Get the X,Y coordinates of the centroids (bind as extra columns)
cent_zip_excounty <-
bind_cols(cent_zip_excounty, cent_zip_excounty %>% st_coordinates() %>% as_tibble() %>% set_names(c("x","y"))) %>%
ungroup() %>% data.frame() %>% as_tibble()
```
```{r, cache = TRUE, echo = FALSE}
# Get the hospital x,y coordinates from the AHA Data.
aha_files <- c("2017" = "../Research-AHA_Data/data/aha/annual/raw/2017/FY2017 ASDB/COMMA/ASPUB17.CSV",
"2016" = "../Research-AHA_Data/data/aha/annual/raw/2016/FY2016 Annual Survey Database/COMMA/ASPUB16.CSV",
"2015" = "../Research-AHA_Data/data/aha/annual/raw/2015/FY2015 Annual Survey Database/COMMA/ASPUB15.CSV"
)
# Get latitude and longitude of general acute care hospitals in 2017 AHA survey.
aha <-
aha_files %>%
map(~(
data.table::fread(here(.x)) %>%
janitor::clean_names() %>%
filter(mstate %in% states) %>%
filter(mstate !="AK" & mstate!="HI") %>%
mutate(system_id = ifelse(!is.na(sysid),paste0("SYS_",sysid),id)) %>%
filter(serv==10))) %>%
map(~rename_in_list(x = .x, from = "hcfaid", to = "mcrnum")) %>%
map(~(.x %>%
select(mname, id, mcrnum , latitude = lat, longitude = long, hrrnum = hrrcode, hsanum = hsacode, admtot, system_id, mloczip, sysname,fips_code=fcounty,mloccity ) %>%
mutate(hrrnum = paste0(hrrnum)) %>%
mutate(hsanum = paste0(hsanum)) %>%
mutate(prvnumgrp = str_pad(mcrnum,width = 6, pad="0")) %>%
mutate(hosp_zip_code = str_sub(mloczip,1,5)) %>%
mutate(longitude = as.numeric(paste0(longitude))) %>%
mutate(latitude = as.numeric(paste0(latitude))) %>%
filter(!is.na(longitude) & !is.na(latitude))
)) %>%
set_names(names(aha_files)) #%>%
aha %>% write_rds(here("output/aha/aha-extract.rds"))
# Shape file of x,y coordinates of all hospitals
xy_aha <- aha[["2016"]] %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
```
```{r, echo = FALSE}
minimum_market_share_zip <- 0.10
minimum_market_share_hosp <- 0.10
# Load the hospital-county patient sharing file
df_hosp_zip <- read_rds(here("output/hospital-county-patient-data/2015/hospital-zip-patient-data.rds")) %>%
group_by(prvnumgrp) %>%
mutate(market_share_hosp = total_cases/ sum(total_cases,na.rm=TRUE)) %>%
filter(zip_code %in% zips_to_show) %>%
group_by(zip_code) %>%
mutate(market_share_zip = total_cases / sum(total_cases,na.rm =TRUE)) %>%
filter(market_share_zip>minimum_market_share_zip | market_share_hosp>minimum_market_share_zip ) %>%
inner_join(aha[["2015"]],"prvnumgrp") %>%
rename(hosp_x = longitude,
hosp_y = latitude) %>%
inner_join(cent_zip_excounty %>% select(zip_code = zcta5ce10 , zip_x = x, zip_y = y), "zip_code") %>%
group_by(zip_code) %>%
mutate(market_share = total_cases / sum(total_cases,na.rm=TRUE)) %>%
ungroup()
# Summarize hopsital market shares from these ZIPs
df_hosp_summ <-
df_hosp_zip %>%
group_by(prvnumgrp) %>%
summarise(total_cases = sum(total_cases)) %>%
mutate(total_cases_orig = total_cases) %>%
mutate(total_cases = scales::rescale(total_cases,to = c(0,1)))
# Summarise hospital market shares for specific ZIPs to use as examples
df_hosp_zip1 <-
df_hosp_zip %>%
#filter(zip_code == zip1) %>%
filter(zip_code==zip1) %>%
group_by(prvnumgrp) %>%
summarise(total_cases_zip1 = sum(total_cases)) %>%
mutate(total_cases_zip1 = scales::rescale(total_cases_zip1,to = c(0,1)))
df_hosp_zip2 <-
df_hosp_zip %>%
#filter(zip_code == zip2) %>%
filter(zip_code==zip2) %>%
group_by(prvnumgrp) %>%
summarise(total_cases_zip2 = sum(total_cases)) %>%
mutate(total_cases_zip2 = scales::rescale(total_cases_zip2,to = c(0,1)))
```
```{r, echo = FALSE}
# Find the nearby hospitals in the AHA data with X,Y coordinates.
hospitals_to_show <- xy_aha %>%
filter(prvnumgrp %in% unique(df_hosp_zip$prvnumgrp))
xy_hosp <- bind_cols(xy_aha %>% filter(prvnumgrp %in% unique(df_hosp_zip$prvnumgrp)) %>% ungroup() %>% data.frame(),
xy_aha %>%
filter(prvnumgrp %in% unique(df_hosp_zip$prvnumgrp)) %>%
st_coordinates() %>% as_tibble() %>% set_names(c('x','y'))) %>%
left_join(df_hosp_summ,"prvnumgrp") %>%
left_join(df_hosp_zip1,"prvnumgrp") %>%
left_join(df_hosp_zip2,"prvnumgrp")
```
## ZIP Codes and Hospitals in Philadelphia County, PA
We begin by plotting the ZIP codes with geographic centroids contained within Philadelphia County, PA. In addition, we plot the location and market share (point size) of all hospitals based on the treatment patterns observed among traditional Medicare patients in 2015. In order for a hospital to be included, at least `r 100*minimum_market_share_zip` percent of ZIP patients, or at least `r 100*minimum_market_share_hosp` percent of the hospital's total FFS Medicare patients, must have been treated. These cutoff thresholds are not required for the approach, but aid in visualization for this example because they trim the data to avoid plotting the hundreds of hospitals that treated only a 1 or 2 patients from the example ZIP codes.
Also note that the hospital plotted in red is St. Joseph's hospital, a 146 bed hospital that [closed on March 11, 2016](https://philadelphia.cbslocal.com/2015/12/29/st-josephs-hospital-in-north-philadelphia-to-close-in-march-2016/). We will (eventually) use this closing in an event study to validate the market classifications--the idea being that hospitals within the market containing St. Joseph's should have been more affected by its closing (i.e., total FFS Medicare patients goes up) than hospitals outside it.
```{r, fig.cap = "Hospitals and ZIP Codes in Philadelphia County, PA, 2015",fig.align = "center",fig.pos="H", fig.height = 4, fig.width = 4, echo = FALSE, echo = FALSE}
# Map the hospitals
sf_zip_excounty %>%
ggplot() + geom_sf() +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
coord_sf(datum=NA) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
geom_point(data = xy_hosp, aes(x=x,y=y,size = total_cases)) +
theme(legend.position = "non") +
geom_text_repel(data = cent_zip_excounty, aes(x=x,y=y,label = zcta5ce10 ), cex = 1) +
geom_point(data = xy_hosp %>% filter(mname=="North Philadelphia Health System"),color="red",aes(x=x,y=y))
```
```{r, echo = FALSE}
# Get a dataset of edges (i.e., share of patients going to each hospital from each ZIP)
df_edges <- df_hosp_zip %>%
select(zip_code,prvnumgrp,market_share,hosp_x,hosp_y,zip_x,zip_y) %>%
gather(type,value,-zip_code,-prvnumgrp,-market_share) %>%
separate(type,into=c("type","coord")) %>%
spread(coord,value) %>%
unite(id,prvnumgrp,zip_code)
df_edges_w <- df_edges %>%
gather(key,value,-id,-market_share,-type) %>%
unite(key,type,key) %>%
spread(key,value)
df_nodes <- df_hosp_zip %>%
select(zip_code,prvnumgrp,hosp_x,hosp_y,zip_x,zip_y) %>%
unique()
```
In the next plot we show the care use patterns among FFS Medicare patients who reside in two ZIP codes: `r paste0(zip1)` and `r paste0(zip2)`. Patient flows from the ZIP code to area hospitals is represtened by the "edge line" liking the ZIP centroid to the geographic location of each hospital.
As seen in the figure, patients from the two example ZIP codes are treated at fundamentally different hospitals. These hospitals, moreover, are within close geographic proximity to each ZIP code. Finally, it is worth noting that many patients from ZIP `r paste0(zip1)` are observed to travel to an out-of-county hospital.
```{r, fig.cap = "Hospitals Utilized Among FFS Medicare Beneficiaries from Select ZIP Codes", fig.height = 4, fig.width = 4,fig.pos="H", echo = FALSE}
sf_zip_excounty %>%
ggplot() + geom_sf() +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_sf(data = sf_zip_excounty %>% filter(zcta5ce10 %in% c(zip1,zip2)), aes(fill = zcta5ce10),alpha = 0.5) +
coord_sf(datum=NA) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
geom_point(data = xy_hosp, aes(x=x,y=y,size = total_cases_zip1),colour = "darkblue") +
geom_point(data = xy_hosp, aes(x=x,y=y,size = total_cases_zip2),colour = "darkblue") +
theme(legend.position = "non") +
geom_text_repel(data = cent_zip_excounty, aes(x=x,y=y,label = zcta5ce10 ), cex = 1) +
#geom_path(data = df_edges_w %>% filter(grepl(zip1,id)), aes(x=zip_x,y = zip_y,xend=hosp_x,yend=hosp_y)) +
geom_line(data = df_edges %>% filter(grepl(zip1,id)|grepl(zip2,id)),aes(x=x,y=y,group=id),size = 1) +
scale_size_continuous(range=c(1,10),limits = c(1,10))
```
The next figure plots "edge links" among all ZIPs and hospitals in Philadelphia County. The plotted line width is also proportional to the total volume of of patients. That is, a thin line connecting a ZIP-hospital pair indicates that only a small fraction of patients from the ZIP code are treated at that particular hospital.
While things will become more clear in a later plot, a rough sense of distinct markets for hospital services can be seen in this simple map-based visualization of patient flows. For example, patients residing in the ZIP codes clustered in the southwest corner of the county all flow into hospitals located there, and there are few "shared" connections among these hospitals with other ZIP codes in the county.
```{r, fig.cap = "Patient Flows Among ZIP Codes and Hospitals in Philadelphia County, PA, 2015",fig.pos="H", echo = FALSE}
sf_zip_excounty %>%
ggplot() + geom_sf() +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
coord_sf(datum=NA) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "non") +
geom_text_repel(data = cent_zip_excounty, aes(x=x,y=y,label = zcta5ce10 ), cex = 1) +
geom_line(data = df_edges ,aes(x=x,y=y,group=id,size=market_share),colour="darkblue") +
#geom_path(data = df_edges_w, aes(xend=zip_x,yend=zip_y,x=hosp_x,y=hosp_y,size=market_share),arrow = arrow(length = unit(0.01, "npc")) )+
geom_point(data = xy_hosp, aes(x=x,y=y,label = mname,size = total_cases)) +
scale_size_continuous(range=c(0,2))
```
The next plot removes the geographic location layering and simply plots the bipartite network. That is, we no longer tether each hospital and ZIP to its geographic location and centroid, respectively. Rather, we utilize a large graph layout (LGL) algorithm to improve the visualization of ties between ZIP codes and hospitals. As in the map in figure above, the strength of ties between ZIP codes and hospitals is represented by the width of the line.
```{r, echo = FALSE}
# Construct Network objects
library(ggraph)
library(tidygraph)
# Create function to convert dataframe to bipartite matrix
convert_to_bipartite <- function(df,id) {
id <- enquo(id)
nn <- df %>% pull(!!id)
foo <- df %>% select(-!!id) %>%
as.matrix()
rownames(foo) <- nn
foo
}
bp_zip_hosp_weighted <-
df_hosp_zip %>%
filter(prvnumgrp %in% hospitals_to_show$prvnumgrp & zip_code %in% zips_to_show) %>%
group_by(zip_code) %>%
mutate(share_of_patients = total_cases / sum(total_cases, na.rm = TRUE)) %>%
ungroup() %>%
# Two ZIPs are connected if at least XX% of their total patients to the same hospital
mutate(connected = as.integer(share_of_patients >= minimum_market_share_zip)) %>%
mutate(share = ifelse(connected==1,share_of_patients,0)) %>%
select(zip_code, prvnumgrp, total_cases) %>%
spread(prvnumgrp, total_cases) %>%
convert_to_bipartite(id = zip_code)
bp_zip_hosp_weighted[is.na(bp_zip_hosp_weighted)] <- 0
net_bp <-
graph_from_incidence_matrix(bp_zip_hosp_weighted,weighted=TRUE) %>%
simplify(., remove.loops=TRUE) %>%
as_tbl_graph() %>%
activate(nodes) %>%
left_join(xy_hosp %>% ungroup() %>%
as.data.frame() %>%
select(name = prvnumgrp,mname),"name") %>%
mutate(label = ifelse(!is.na(mname),mname,name))
```
```{r, fig.cap = "Visualization of ZIP-Hospital Patient Flows as a Bipartite Network Object", fig.width = 8, fig.height=8,fig.pos="H", echo = FALSE}
set.seed(2)
net_bp %>%
ggraph(layout='lgl') +
#scale_x_continuous(limits = c(-15,10)) +
#scale_y_continuous(limits = c(-20,6)) +
geom_edge_link(aes(width = weight,alpha=weight),show.legend = FALSE) +
geom_node_point(aes(colour=type), cex = 6) +
geom_node_text(aes(label = label),cex = 4,repel=TRUE) +
remove_all_axes +
theme(legend.position = "none") +
theme(text = element_text(size = 20))
```
The underlying market structure of Philadelphia-area hospitals starts to become a bit more clear in this representation of the data. Again, we see that ZIP codes tend to cluster around certain sets of hospitals. For example, Holy Reedeemer, Nazareth, Aria-Jefferson and Jeanes hospital all tend to draw on patients from similar ZIP codes. By comparison, Mercy Fitzgerald, the UPenn hospitals, and Lankeneau Medical Center draw patients from a different cluster of ZIPs.
```{r, echo = FALSE}
#https://arxiv.org/pdf/1309.0242.pdf
bp_zip_hosp <-
df_hosp_zip %>%
filter(prvnumgrp %in% hospitals_to_show$prvnumgrp & zip_code %in% zips_to_show) %>%
group_by(zip_code) %>%
mutate(share_of_patients = total_cases / sum(total_cases, na.rm = TRUE)) %>%
ungroup() %>%
mutate(connected = as.integer(share_of_patients >= minimum_market_share_zip)) %>%
mutate(share = ifelse(connected==1,share_of_patients,0)) %>%
select(zip_code, prvnumgrp, connected) %>%
spread(prvnumgrp, connected) %>%
convert_to_bipartite(id = zip_code)
bp_zip_hosp[is.na(bp_zip_hosp)] <- 0
# bp_zip_hosp[1:10,1:10]
up_zip <-bp_zip_hosp %*% t(bp_zip_hosp)
# Get a weighted version (i.e., continuous fraction of patients, rather than using threshold)
up_zip_tmp <- bp_zip_hosp_weighted %*% t(bp_zip_hosp_weighted>0)
up_zip_w <- up_zip_tmp / diag(up_zip_tmp)
# This just uses the binary "connected"
net_ex <-
graph_from_adjacency_matrix(up_zip, weighted = TRUE,mode="undirected") %>%
simplify(., remove.loops = TRUE)
# This uses the fraction shared
net_ex_weighted <-
graph_from_adjacency_matrix(up_zip_w, weighted = TRUE,mode="undirected") %>%
simplify(., remove.loops = TRUE)
get_community <- function(gg,method,...) {
if (method == "walktrap") {
initial_communities <-
walktrap.community(gg,...)
} else if (method == "fast_greedy") {
initial_communities =
cluster_fast_greedy(gg,...)
} else if (method =="multilevel") {
initial_communities <- multilevel.community(gg,...)
} else if (method == "spinglass") {
initial_communities <- spinglass.community(gg,...)
} else if (method == "infomap") {
initial_communities <- infomap.community(gg,...)
} else if (method == "louvain") {
initial_communities <- cluster_louvain(gg, weights = NULL)
} else if (method == "edge_between") {
initial_communities <- edge.betweenness.community(gg,...)
} else if (method == "eigen") {
initial_communities <- leading.eigenvector.community(gg,...)
} else if (method == "label_prop") {
initial_communities <- label.propagation.community(gg,...)
}
return(list(fit = initial_communities, market = membership(initial_communities)))
}
market_cw <- net_ex_weighted %>% get_community(method = "walktrap",
steps = 5,
merges = TRUE,
modularity = TRUE,
membership = TRUE)
market_fg <- net_ex_weighted %>% get_community(method = "fast_greedy")
market_ml <- net_ex_weighted %>% get_community(method = "multilevel")
market_sg <- net_ex_weighted %>% get_community(method = "spinglass")
market_info <- net_ex_weighted %>% get_community(method = "infomap")
market_louv <- net_ex_weighted %>% get_community(method = "louvain")
market_edge <- net_ex_weighted %>% get_community(method = "edge_between")
market_eigen <- net_ex_weighted %>% get_community(method = "eigen")
market_label <- net_ex_weighted %>% get_community(method = "label_prop")
gf_ex <- tidygraph::as_tbl_graph(net_ex,directed=FALSE) %>%
activate(nodes) %>%
mutate(zip_code = name) %>%
mutate(market_cw = factor(market_cw$market[name])) %>%
mutate(market_fg = factor(market_fg$market[name])) %>%
mutate(market_ml = factor(market_ml$market[name])) %>%
mutate(market_sg = factor(market_sg$market[name])) %>%
mutate(market_info = factor(market_info$market[name])) %>%
mutate(market_louv = factor(market_louv$market[name])) %>%
mutate(market_edge = factor(market_edge$market[name])) %>%
mutate(market_eigen = factor(market_eigen$market[name])) %>%
mutate(market_label = factor(market_label$market[name]))
```
Next we will take this bipartite matrix and transform it into a unipartite matrix summarizing the total number of shared hospital connections between ZIP codes. Here, we will define two ZIPs as "connected" if, for one or more hospitals, at least `r 100*minimum_market_share_zip` percent of the patients from that ZIP are treated at the hospital. Thus, if 15% of patients from ZIP A and 25% of patients from ZIP B go to hospital 1, those two ZIPs would be connected. However if just 1% of patients from ZIP A and 25% of patients from ZIP B go to the hospital, those ZIPs would not be counted as connected.
The plot below provides visualization of the unipartite network of ZIP codes in Philadelphia county. Again, we can see clear "clustering" of ZIPs. That is, these are ZIPs that tend to draw on the same hospitals. In the plot, the weight of the edge lines linking two ZIP codes is proportional to the total number of "connected" hospitals those two ZIPs have.
```{r, fig.cap = "Visualization of ZIP-Hospital Connections as a Unipartite Network of ZIP Codes", fig.height =5, fig.width = 5,fig.pos="H", echo = FALSE}
set.seed(234)
gf_ex %>%
activate(edges) %>%
filter(weight>0) %>%
ggraph(layout='fr') +
geom_edge_link(aes(width = weight, alpha = weight), show.legend = FALSE) +
#geom_node_point(aes(colour=market_cw),size = 6) +
geom_node_point(size=6,colour="lightblue")+
geom_node_text(aes(label = zip_code)) +
remove_all_axes +
theme(legend.position = "none")
```
## Identifying Markets via Community Detection
Our next step is to use this network representation of hospital use to "detect" markets for hospital services. A nice feature of this approach is that we can detect markets from two perspectives: the geography (i.e., what ZIP codes tend to send patients to similar hospitals?) or the hospital (i.e., what hospitals tend to draw patients from similar ZIP codes?).
We can see in the representation of Philadelphia that the ZIP code geographies tend to cluster around each other--that is, patients from clusters of geographically-proximate ZIP codes tend to use the same hospitals. There is a clear separation of certain clusters of ZIP codes, while other ZIP codes (e.g., 19134s) straddle different hospital "communities."
Over the years a [variety of community detection algorithms](https://www.nature.com/articles/srep30750) have been developed. Each takes as its input a network, and returns a community membership attribute to each node in the network. The algorithms are constructed such that hte network is sub-dividied into mutually exclusive communities--though some [more recent work](https://dl.acm.org/citation.cfm?id=2501657) has allowed for nodes to be represented in more than one community.
We'll begin by delploying the Cluster Walktrap algorithm on the unipartite (ZIP-ZIP) network for Philadelphia county. The output from this algorithm has been added to the plot below.
```{r, fig.cap = "Visualization of Hospital Markets as Detected by the Cluster Walktrap Algorithm", fig.height =5, fig.width = 5,fig.pos="H", echo = FALSE}
set.seed(234)
gf_ex %>%
activate(edges) %>%
filter(weight>0) %>%
ggraph(layout='fr') +
geom_edge_link(aes(width = weight, alpha = weight), show.legend = FALSE) +
geom_node_point(aes(colour=market_cw),size = 6) +
geom_node_text(aes(label = zip_code)) +
remove_all_axes +
theme(legend.position = "none")
```
```{r, eval = FALSE}
library(ggforce)
net_bp2 <- net_bp %>%
activate(nodes) %>%
mutate(market = market_cw$market[name]) %>%
mutate(market = ifelse(is.na(market),0,market))
set.seed(2)
net_bp2 %>%
ggraph(layout='lgl') +
#scale_x_continuous(limits = c(-15,10)) +
#scale_y_continuous(limits = c(-20,6)) +
geom_edge_link(aes(width = weight,alpha=weight),show.legend = FALSE) +
geom_node_point(aes(colour=type), cex = 6) +
geom_node_text(aes(label = label),cex = 4,repel=TRUE) +
remove_all_axes +
theme(legend.position = "none") +
theme(text = element_text(size = 20)) -> p
foo <- net_bp2 %>% activate(nodes) %>% data.frame()
p_ <- cbind.data.frame(ggplot_build(p)$data[[2]],foo)
p + ggforce::geom_mark_ellipse(data = p_, aes(x=x,y =y, fill=factor(market),filter=market!=0),alpha=0.1,
linetype=0) +
ggsci::scale_color_aaas()
```
```{r, echo = FALSE}
# Fit Ensemble
get_up_market_matrix <- function(foo) {
foo2 <- foo %>% data.frame() %>%
mutate(connected = 1) %>%
spread(market,connected) %>%
convert_to_bipartite(id=zip_code)
foo2[is.na(foo2)] <- 0
foo2 %*% t(foo2)
}
up_market <-
gf_ex %>%
activate(nodes) %>%
data.frame() %>%
select(-name) %>%
gather(algorithm,market,-zip_code) %>%
group_by(algorithm) %>%
nest() %>%
mutate(bp =map(data,~(
get_up_market_matrix(.x)
)))
#https://arxiv.org/pdf/1309.0242.pdf , pp5
# "Firstly, a complete graph, G = (V, F), is constructed using the
# data from candidate communities, which are the output
# from some community detection algorithm(s). The set
# of nodes, V , is the original set of nodes in the network,
# and the set of edges F now indicate that two nodes have
# been found in the same community. The matrix, F =
# [Fij ], where the element in row i and column j, Fij is the
# frequency of the event that nodes i and j has been found
# in the same candidate community."
gf_F <- up_market$bp[[1]]
for (i in 2:length(up_market$bp)) {
gf_F <- gf_F + up_market$bp[[i]]
}
# Agglomorative Hierarchical Clustering
ensemble.clust <- hclust(dist(gf_F))
# Dendrogram of the Clustering
ensemble_dendro <- as.dendrogram(ensemble.clust)
# GGplot of the dendrogram
ensemble_dendro_plot <- ensemble_dendro %>% ggdendrogram(rotate = TRUE)
# Get the modularity at each height
ensemble_modularity_all <-
1:100 %>%
map_dbl(~(modularity(net_ex, cutree(ensemble.clust , h = .x))))
ensemble_modularity_max_which <- min(which(ensemble_modularity_all == max(ensemble_modularity_all)))
ensemble_modularity_max <- ensemble_modularity_all[ensemble_modularity_max_which]
ensemble_membership <- cutree(ensemble.clust , h = ensemble_modularity_max_which)
gf_ex <-
gf_ex %>%
activate(nodes) %>%
mutate(market_ensemble = factor(ensemble_membership[name]))
df_modularity <-
data.frame(ensemble = modularity(net_ex,ensemble_membership),
walktrap = modularity(net_ex,market_cw$market),
fast_greedy = modularity(net_ex, market_fg$market),
multilevel = modularity(net_ex,market_ml$market),
spinglass = modularity(net_ex,market_sg$market),
infomap = modularity(net_ex,market_info$market),
louvain = modularity(net_ex,market_louv$market),
eigen = modularity(net_ex,market_eigen$market),
label_prop = modularity(net_ex,market_label$market),
edge_between = modularity(net_ex,market_edge$market))
```
```{r, echo = FALSE}
df_markets_ensemble <-
cbind.data.frame(zip_code = names(market_cw$market), market_cw = market_cw$market %>% unname() %>% as.vector()) %>%
mutate(market_fg = market_fg$market[zip_code]) %>%
mutate(market_ml = market_ml$market[zip_code]) %>%
mutate(market_sg = market_sg$market[zip_code]) %>%
mutate(market_info = market_info$market[zip_code]) %>%
mutate(market_edge = market_edge$market[zip_code]) %>%
mutate(market_eigen = market_eigen$market[zip_code]) %>%
mutate(market_label = market_label$market[zip_code]) %>%
mutate(market_ensemble = ensemble_membership[zip_code])
# HSA, HRR, CZ Markets for Comparisons Later
zcta_to_hrr_hsa <- read_csv(here("public-data/shape-files/nber-hrr-hsa-pcsa/ziphsahrr2014.csv")) %>%
janitor::clean_names() %>%
rename(zip_code = zipcode ) %>%
filter( zip_code %in% zips_to_show)
market_hsa <- zcta_to_hrr_hsa %>% pull(hsanum) %>% as.numeric()
names(market_hsa) <- paste0(zcta_to_hrr_hsa$zip_code)
market_hrr <- zcta_to_hrr_hsa %>% pull(hrrnum) %>% as.numeric()
names(market_hrr) <- paste0(zcta_to_hrr_hsa$zip_code)
county_to_cz <- data.table::fread(here("public-data/shape-files/commuting-zones/counties10-zqvz0r.csv")) %>%
janitor::clean_names() %>%
rename(fips_code = fips) %>%
group_by(out10) %>%
mutate(commuting_zone_population_2010 = sum(pop10, na.rm=TRUE)) %>%
mutate(fips_code = str_pad(paste0(fips_code),width = 5, pad="0")) %>%
select(fips_code,
commuting_zone_id_2010 = out10,
commuting_zone_population_2010 ) %>%
filter(fips_code=="42101")
market_cz <- rep(county_to_cz$commuting_zone_id_2010,length(market_hsa))
names(market_cz) <- paste0(zcta_to_hrr_hsa$zip_code)
# Final Hospital Markets Data Frame
df_hosp_markets <-
df_hosp_zip %>%
left_join(df_markets_ensemble %>%
mutate(market_hrr = market_hrr[zip_code],
market_hsa = market_hsa[zip_code],
market_cz = market_cz[zip_code]) %>%
gather(type,market,-zip_code),"zip_code") %>%
group_by(type,market,prvnumgrp) %>%
summarise(total_cases = sum(total_cases)) %>%
ungroup() %>%
mutate(type2 = gsub("market_","total_cases_",type)) %>%
gather(key,value,-type,-type2,-prvnumgrp,-market) %>%
group_by(prvnumgrp) %>%
mutate(foo = row_number()) %>%
unique() %>%
#filter(row_number() %in% c(67,80))
spread(type,market) %>%
select(-key) %>%
spread(type2,value) %>%
select(-foo)
#df_hosp_markets$[is.na(df_hosp_markets)] <- 0
xy_hosp_market <-
xy_hosp %>%
left_join(df_hosp_markets,"prvnumgrp")
sf_markets <- sf_zip_excounty %>%
mutate(zip_code = zcta5ce10) %>%
left_join(df_markets_ensemble, "zip_code") %>%
mutate(market_hrr = market_hrr[zip_code],
market_hsa = market_hsa[zip_code],
market_cz = market_cz[zip_code])
```
The algorithm has identified four distinct communities. In the plot below, we map out each of these communities. Each panel of this plot shows the ZIP codes included in a detected market. The dots again correspond to the geographic location of hospitals visited by individuals from that market. The dot sizes are furthermore scaled to be proportional to patient volume / market share.
```{r,fig.cap = "Geographic Markets Identified by Cluster Walktrap Algorithm",fig.height = 8, fig.width =8,fig.pos="H", echo = FALSE}
sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_cw)), aes(x=x,y=y,size = total_cases_cw)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_cw) +
coord_sf(datum=NA)
```
Because the cluster walktrap algorithm is hierarchical, it's useful to plot out a heatmap and dendrogram of the unipartite (ZIP-ZIP) matrix. This allows us to visualize not only the strengh of hospital connections between two ZIPs, but also the specific clustering process that gives rise to the detected communities.
```{r,fig.width = 8, fig.height=8, eval = TRUE, echo = FALSE}
fit <- market_cw$fit
market.dendro <- as.dendrogram(fit)
dendro.plot <- ggdendrogram(data = market.dendro, rotate = TRUE) + theme(axis.text.y = element_text(size = 6))
df_heatmap.plot <-
up_zip_w %>% data.frame() %>% rownames_to_column() %>%
rename(zip_code = rowname) %>%
gather(zip2,share,-zip_code) %>%
as_tibble() %>%
mutate(market = membership(fit)[zip_code]) %>%
mutate(zip_code = factor(zip_code, levels = names(membership(fit))[order.dendrogram(market.dendro)])) %>%
mutate(zip2 = gsub("X","",zip2)) %>%
mutate(zip2 = factor(zip2, levels = names(membership(fit))[order.dendrogram(market.dendro)])) %>%
mutate(share = share*100)
library(ggnewscale)
heatmap.plot <-
df_heatmap.plot %>%
ggplot(aes(x = zip2,y=zip_code)) +
geom_tile(aes(fill = share)) +
scale_fill_gradient2(low="white",high="darkred") +
ggnewscale::new_scale_fill() +
geom_tile(aes(fill = as.factor(market)),alpha = 0.1) +
ggsci::scale_fill_aaas() +
theme(axis.text.y = element_text(size = 6)) +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "none") +
theme(axis.text.x=element_text(angle = -90, hjust = 0,size =6)) +
labs(fill = "",x ="ZIP Code")
#scale_colour_manual(guide=FALSE,values = c("purple","brown","darkblue","green")) +
; heatmap.plot
postscript(file = here("output/figures/dendrogram-cluster-walktrap.eps"), horiz = FALSE, onefile = FALSE, width = 8, height = 8, fonts=c("serif"))
grid::grid.newpage()
print(heatmap.plot, vp = grid::viewport(x = 0.4, y = 0.5, width = 0.8, height = .93))
print(dendro.plot, vp = grid::viewport(x = 0.9, y = 0.51, width = 0.2, height = .97))
x<- dev.off()
png(file = here("output/figures/dendrogram-cluster-walktrap.png"), width = 6*100 ,height = 6*100,res=100)
grid::grid.newpage()
print(heatmap.plot, vp = grid::viewport(x = 0.4, y = 0.5, width = 0.8, height = .93))
print(dendro.plot, vp = grid::viewport(x = 0.9, y = 0.51, width = 0.2, height = .97))
x <- dev.off()
```
![](output/figures/dendrogram-cluster-walktrap.png)
In the plot above, the rows have been subdivided (by cell line color) into the four total markets identified via community detection. That is, these are the ZIP codes corresponding to each of the four markets. The cell shadings are scaled to visualize the strength of connections between ZIPs.
When parsing this visulation it helps to keep in mind the intuition for the objective of community detection. The community detection algorithm is designed to detect densely-connected groups of nodes (in this case ZIP codes) with many connections within the groups, and fewer connections outside of the groups. The heatmap plots the strength of shared hospital "connections" between each ZIP pair, while the dendrogram plots a hierarchy of these connections starting with the most densely connected ZIPs at the bottom.
A specific example is also useful. In total, 1,182 patients from ZIP 19128 are treated at 2 hospitals, 29% at Chestnut Hill hospital (390026) and 71% at Roxborough Memorial hospital (390304). By comparison, the 1,807 patients from ZIP 19144 are treated at Chestnut Hill (21%), Einstein Medical Center (63%; ID = 390142) and Roxborough Memorial (16%).
```{r, echo = FALSE}
bp_zip_hosp_weighted %>% t() %>% data.frame() %>% rownames_to_column() %>%
filter(X19128>0|X19144>0) %>%
select(rowname,X19128,X19144) %>%
mutate_at(vars(2:3),function(x) paste0(x," (",round(100*x/sum(x),0),"%)")) %>%
kable(col.names = c("Hospital ID","ZIP 19128","ZIP 19144"))
```
From this table we can see that 100% of ZIP 19128's hospital use is in "shared" hospitals when paired with 19144, while 37% of ZIP 19144's hospital use is in "shared" hospitals when compared with 19128.
This is precisely the information portrayed in the cell shadings in the heatmap: the full red shading for row 19128, column 19144 (near the top right of the plot) indicates that 100% of 19128's patients are treated in similar hospitals as used by patients in 19144. By comparison, the shading for row 19144, column 19128 is shaded a lighter red, reflecting the fact that as noted above, only 37% of 19144's patients go to similar hospitals as used by those in 19128.
With this correlation (shading) in mind, we can now focus on the dendrogram on the righthand side of the plot. For example, heatmap cells shaded bright red indicate clusters of ZIPs where patients essentially go the same hospitals. This results in a pairing of ZIPs far down the tree diagram represented in the dendrogram. In other words, sets of ZIP codes that essentially use the same hospitals will get paired together far down the hierarchy.
With these tightly connected ZIP pairs connected we can then start to work up the tree, bringing in new ZIP codes that share a subtantial, but not 100% share, of hospital overlap. Eventually, the four distinct markets emerge: this is apparent by the four redish "blocks" running diagonal from southwest to northeast in the heatmap. But within each market are sub-markets of densely connected ZIPs, and the dendrogram / heatmap is designed such that we can identify these as well.
We can also see in both the network plot and the heatmap plot that there are some ZIP codes that straddle markets. That is, these are ZIPs that could just possibly be classified in one market vs. another. In the network reprsentation plot, these ZIP codes (e.g., 19140, 19134) as in-between the clusters of ZIP codes. In the heatmap, we see this in the isloated pockets of "clustering" off the diagnoal (e.g., in the rows for 19141 and 19134).
## An Ensemble-Based Approach to Market Detection
The example above was shown for a single community detection algorithm (the cluster walktrap). But there are many candidate algorithms we can draw from. These community detection methods have been developed in diverse fields (physics, sociology, biology, etc.), and each method may result in a slightly different partitioning of the network. The natural question, then, is which method should we use? Or, can we improve market detection by deploying an ensemble of approaches, then identifying the final market boundaries based on a consensus among this ensemble? That is the approach we will lay out here.
### Modularity
Before we proceed, it is useful to define a measure to asesss the relative performance of community detection algorithms in partitioning our network into geographic markets. **Modularity** is a widely used measure for this purpose.
[Modularity is](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1482622/) the fraction of *observed* edges that fall within defined groups in a network minus the expected fraction of edges within the group if the edges were placed at random. Modularity measures between -1 and 1, and will be positive when the number of edges within the groups exceeds the expected number based on random allocation of edges.
In most applications, the randomization of edges is done while preserving the observed **degree** of each node. Degree is a count of the total number of connections a node has. In an unweighted network, degree is simply the number of edge (connection) lines coming from the node; in a weighted network (as used here) it the sum of the weights attached to each of these edges.
In the case of the cluster walktrap method in our example above, we achieve a modularity score of `r round(df_modularity["walktrap"],4)`. We also ran ``r length(df_modularity)-1`` alternative community detection algorithms, and their modularity scores are summarized in the table below:
```{r, echo = FALSE}
df_modularity %>% t() %>% data.frame() %>%
rownames_to_column() %>%
mutate_at(vars(2),round,4) %>%
filter(rowname!="ensemble") %>%
set_names("Algorithm","Modularity") %>%
kable()
```
The table shows that some other community detection approaches (e.g., the Multilevel algorithm, the Spinglass algorithm, and the [Louvain modularity approach](https://en.wikipedia.org/wiki/Louvain_Modularity)) achieve higher modularity values, indicating that they have done a better job at partitioning the network.
### Frequency Matrix
We'll next construct a **frequency matrix** which summarizes the total frequency that ZIPs *i* and *j* are assigned to the same market across the ``r length(df_modularity)-1`` detection algorithms we will use in the ensemble.
\[
\mathbf{F} = [F_{ij}]
\]
This frequency matrix is visualized in the heatmap below. What is clear from this matrix is that each of the community detection methods groups ZIPs into similar markets. This is apparent from the diagonal black blocks running from southwest to northeast in the figure. These blocks are mostly shaded black--indicating that the ZIP combination was classified into the same market in all community detection approaches.
Again, as under the single detection approach considered above, we see examples of a few ZIP codes that are classified in different markets--for example, 19133 an 19132 are not consistently categoried in the same market.
```{r, echo = FALSE}
gf_F %>%
data.frame() %>%
rownames_to_column() %>%
rename(zip_code = rowname) %>% as_tibble() %>%
gather(zip2, frequency, -zip_code) %>%
mutate(zip_code = as.numeric(paste0(zip_code))) %>%
mutate(zip2 = gsub("X","",zip2)) %>%
mutate(zip2 = as.numeric(paste0(zip2))) %>%
mutate(zip_code = factor(zip_code, levels = names(membership(fit))[order.dendrogram(market.dendro)])) %>%
mutate(zip2 = factor(zip2, levels = names(membership(fit))[order.dendrogram(market.dendro)])) %>%
ggplot(aes(x = zip_code,y=zip2)) +
geom_tile(aes(fill = frequency)) +
theme(axis.text.x=element_text(angle = -90, hjust = 0,size =6),
axis.text.y= element_text(size =6)) +
scale_fill_gradient(low = "white", high = "black", breaks = c(0,3,6,9)) +
labs(fill = "Frequency", x = "ZIP Code", y= "ZIP Code")+
theme(legend.position ="right")
```
### Ensemble Clustering
We next fit an agglomerative hierarchical clustering algorithm to the frequency matrix to, in essense, find the clusterings of ZIP codes with the strongest ties across the various community detection methods.
This clustering method begins by assigning each ZIP to its own cluster (market). For the first iteration, it groups the clusters that are the most similar on some measure (in this case, that measure is the frequency at which they are assigned to the same market). This same procedure then iterates by grouping in additional ZIPs until eventually, there is just a single cluster that contains all ZIP codes. After this step, the algorithm stops.
For this example we will use the `hclust()` function default, which draws on the ["complete linkage" method](https://en.wikipedia.org/wiki/Complete-linkage_clustering). In principle, however, any clustering method could be used.
First we will plot the dendrogram produced by the clustering algorithm. Here we can see that even early on (e.g., at iteration 0), ZIP codes naturally start to sort into markets. In addition, note that at iteration 5 the "marginal" ZIPs identified above (19133 and 19132) get assigned to their own small market. Then, as we iterate forward (e.g., around iteration 20), these ZIPs get folded in to markets they are more similar with.
```{r, echo = FALSE}
ensemble_dendro_plot + ggtitle("Dendrogram from Ensemble Clustering") + theme_tufte_revised() +
labs(x = "ZIP Code", y = "Clustering Iteration")
```
The step is to define our final markets. We could define them based on the clusterings of ZIP codes at any point along the X axis in the dendrogram. On the extreme ends, we could use the 10 markets defined at iteration 0, or we could assign all ZIPs to the same market (i.e., ZIP clusters as defined at iteration 50).
The modularity score, again, is useful here. That is, we can work our way up the dendrogram--at each step calculating a modularity score--and use the level that maximizes modularity as our final market definitions.
Modularity values as a function of iteration are plotted below. For the sake of comparison, we also plot (using horizontal lines) the modularity scores from each of the individal market detection algorithms.
```{r, echo = FALSE}
data.frame(iteration = 1:length(ensemble_modularity_all), modularity = ensemble_modularity_all) %>%
filter(modularity>0) %>%
mutate(rowname = "ensemble") %>%
ggplot(aes(x = iteration, y = modularity, colour = rowname)) +
geom_line(size=1) +
theme_tufte_revised() +
ggtitle("Modularity Scores ") +
scale_y_continuous(breaks = c(0,.25,0.5,round(max(ensemble_modularity_all),4))) +
geom_hline(data = as.data.frame(df_modularity %>% t()) %>% rownames_to_column() ,
aes(yintercept = V1, colour = rowname),lty=2) +
labs(colour = "Market Detection Method",y = "Modularity Value",x = "Clustering Iteration")
```
We see that modularity is maximized at 0.592, or at about iteration 22. Coincidentally, this is the modularity score for several of the individual algorithms--indicating that these algorithms did as good of a job at identifying the markets as the ensemble method did (but there was no guarantee this would be the case).
Taking the ZIP market definitions at this iteration and plotting them, we arrive at the following **final** map of hospital markets for Philadelphia county:
```{r,fig.cap = "Geographic Markets Identified by Ensemble-Based Approach",fig.height = 8, fig.width =8,fig.pos="H", echo = FALSE}
sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_ensemble)), aes(x=x,y=y,size = total_cases_ensemble)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_ensemble) +
coord_sf(datum=NA)
```
## Relationship Between Community Detection and Other Common Market Definition Approaches
A useful exercise is to think through how other common market definitions (HSAs, HRRs, commuting zones) compare and tie into the network analytic approach articulated above. These market definitions as applied to Philadelphia can be seen in the maps below.
```{r,fig.cap = "Geographic Markets Identified by HSA, HRR, and Commuting Zone",fig.height = 8, fig.width =8,fig.pos="H", echo = FALSE}
p_ensemble <- sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_ensemble)), aes(x=x,y=y,size = total_cases_ensemble)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_ensemble) +
coord_sf(datum=NA) + #ggtitle("Ensemble Community Detection") +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Ensemble-Based Approach")
p_hsa <- sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_hsa)), aes(x=x,y=y,size = total_cases_hsa)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_hsa) +
coord_sf(datum=NA) + ggtitle("Hospital Service Area") +
theme(plot.title = element_text(hjust = 0.5))
p_hrr <- sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_hrr)), aes(x=x,y=y,size = total_cases_hrr)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_hrr) +
coord_sf(datum=NA) + ggtitle("Hospital Referral Region") +
theme(plot.title = element_text(hjust = 0.5))
p_cz <- sf_zip_excounty %>%
filter(zcta5ce10 != "19112") %>%
ggplot() + geom_sf(alpha = 0.1) +
remove_all_axes +
geom_sf(data = sf_excounty_alone, colour = "black",lwd = 1, alpha = 0.1) +
geom_text_repel(data = xy_hosp, aes(x=x,y=y,label = mname),cex = 1.5) +
theme(legend.position = "none") +
geom_sf(data = sf_markets %>% filter(zcta5ce10 != "19112") , aes(fill = "grey"),fill ="lightgrey") +
geom_point(data = xy_hosp_market %>% filter(!is.na(market_cz)), aes(x=x,y=y,size = total_cases_cz)) +
# scale_fill_distiller(palette=3) +
facet_wrap(~market_cz) +
coord_sf(datum=NA) + ggtitle("Commuting Zone") +
theme(plot.title = element_text(hjust = 0.5))
p_hsa + {p_hrr + p_cz} + plot_layout(ncol=1)
```
As seen in the map, only two HSAs are represented in Philadelphia county, while only a single HRR and commuting zone is represented. Thus, measures of hospital concentration that draw on these measures will include *all* hospitals in the county (plus in neighboring areas also inclued in the HSA, HRR, or CZ).
### Community Detection vs. Health Service Areas (HSAs)