The code for MEDSAT: A Public Health Dataset for England Featuring Medical Prescriptions and Satellite Imagery.
👩⚕️ 🏥 🌲 🏡 💊 💉 🧑💼 👨👩👧👦 👶 👵
The MEDSAT dataset serves as a comprehensive resource for public and population health studies in England, encompassing medical prescription quantity per capita as outcomes and a wide array of sociodemographic and environmental variables as features. In this release, we provide data snapshots for the years 2019 (pre-COVID) and 2020 (COVID). Sociodemographic variables align with the latest UK census from 2021.
The data can be found at TUMMedia (data sharing service from Technical University Munich):
https://doi.org/10.14459/2023mp1714817
The dataset is released under the CC BY-SA 4.0 license.
collate_data
-- contains the code for producing the four data components from different sources, and merging them into a single dataset, MEDSAT.data_master.ipynb
-- this code collates the 4 components into single master data files for each year. It produces both .csv and .geojson master files as output.environmental_data_extractor
-- this module produces environmental point features and is the only module that requires signing up for an external service, which is Google Earth Engine. The reason is that we collate, process, and analyse enormous amounts of satellite products to calculate the yearly environmental point features. The instructions for signing up and running the extractor code are found in the README file, and we also associated slides with screenshots to help in the process.image_features_extractor
-- here we provide the code for extracting area-(LSOA)-level features from the Sentinel-2 composite images.NHS_prescription_parser
-- this module serves for extracting outcomes for MEDSAT from NHS prescription data.sociodemographic_data_parser
-- this notebook allows extraction of sociodemographic features per LSOA from the raw files downloaded from the UK ONS website.
models_and_xai
-- contains the code for predicting and explaining health outcomes from the features.
auxiliary_data
-- holds spatial data, i.e., LSOA and Region shapefiles.point_data
-- contains MEDSAT and its components.{year}_spatial_raw_master.csv
and{year}_spatial_raw_master.geojson
-- the MEDSAT master files.data_sources
-- contains raw input downloaded from the UK Census 2021.image_features
-- contains the features extracted from each image composite band across LSOAs for two seasons: winter (DJF), and summer (JJA) for the year 2020.data_components
-- this is where the results from each module parsing different data sources are placed to be merged into yearly master files.
STEPS
-
Download the data from TUMMedia. The structure from the server is mirrored here, which means that you can place the content of
auxiliary_data
from the sever intoauxiliary_data
in the resposotory and the same for other data folders. -
Run the code in
models_and_xai
to analyse the MEDSAT dataset. -
If you want to re-create any data component of MEDSAT (e.g., environmental point features, or sociodemographic features), you can
cd
into the module folder, and run the respective module for generating it (1-4 below). For each module, you will find their own README file inside specifying how to use the code within the module. We also provide a specific conda envrionment .yml specification for each module (or a guide for how to setup GEE in the Google Colab environment in the case ofenvironmental_data_extractor
).-
- run the jupyter notebooks from
environmental_data_extractor
to obtain environmental point features. Since this runs in the Google Colab environment, the results will get saved into your Google Drive, and you can download and place them intodata/point_data/data_components/{year}_environment.csv
- run the jupyter notebooks from
-
- run
NHS_prescriptions_parser
to obtain prescrption outcomes for selected conditions. The resulting outputs will get placed intodata/point_data/data_components/{year}_outcomes.csv
- run
-
WasdiAverageComposite
runs on WASDI servers to obtain environmental image features, i.e., 37 x 4 seasonal Sentinel-2 composite images (totalling ~120 GB per season, i.e., ~600 GB per year). This data is saved on the TUM server (on the data address provided at the beginning of this file). You can download (a part of) this data intodata/image_data/
to continue parsing them with the code provided here.
-
- run
sociodemographic_data_parser
to obtain sociodemographic features from the UK census. They will get placed intodata/point_data/data_components/controls.csv
- run
collate data
pulls the four extracted data components into a single master file per year saved intodata/point_data/{year}_spatial_raw_master
both as .csv and .geojson.
-