Skip to content

Commit

Permalink
CDC_imports_AutoRefresh
Browse files Browse the repository at this point in the history
  • Loading branch information
SudhishaK committed Jan 20, 2025
1 parent 40800cc commit 19676f8
Show file tree
Hide file tree
Showing 23 changed files with 560 additions and 173 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ typeOf: dcs:StatVarObservation
observationDate: C:OzoneCTPollution->date
variableMeasured: dcs:Mean_Concentration_AirPollutant_Ozone
observationPeriod: "P8H"
unit: parts per billion (ppb)
unit: PartsPerBillion
value: C:OzoneCTPollution->Value

Node: E:OzoneCTPollution->E2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,34 @@ Node: E:OzoneCountyPollution->E1
observationAbout: C:OzoneCountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:OzoneCountyPollution->date
value: C:OzoneCountyPollution->O3_mean_pred
value: C:OzoneCountyPollution->o3_mean_pred
observationPeriod: "P8H"
unit: parts per billion (ppb)
unit: PartsPerBillion
variableMeasured: dcs:Mean_Concentration_AirPollutant_Ozone

Node: E:OzoneCountyPollution->E2
observationAbout: C:OzoneCountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:OzoneCountyPollution->date
value: C:OzoneCountyPollution->O3_med_pred
value: C:OzoneCountyPollution->o3_med_pred
observationPeriod: "P8H"
unit: parts per billion (ppb)
unit: PartsPerBillion
variableMeasured: dcs:Median_Concentration_AirPollutant_Ozone

Node: E:OzoneCountyPollution->E3
observationAbout: C:OzoneCountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:OzoneCountyPollution->date
value: C:OzoneCountyPollution->O3_max_pred
value: C:OzoneCountyPollution->o3_max_pred
observationPeriod: "P8H"
unit: parts per billion (ppb)
unit: PartsPerBillion
variableMeasured: dcs:Max_Concentration_AirPollutant_Ozone

Node: E:OzoneCountyPollution->E4
observationAbout: C:OzoneCountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:OzoneCountyPollution->date
value: C:OzoneCountyPollution->O3_pop_pred
value: C:OzoneCountyPollution->o3_pop_pred
observationPeriod: "P8H"
unit: parts per billion (ppb)
unit: PartsPerBillion
variableMeasured: dcs:PopulationWeighted_Concentration_AirPollutant_Ozone
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@ Node: E:PM25CTPollution->E1
observationAbout: C:PM25CTPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PM25CTPollution->date
variableMeasured: dcs:Mean_Concentration_AirPollutant_PM2.5
variableMeasured: C:PM25CTPollution->StatisticalVariable
observationPeriod: "P24H"
unit: μg/m3
unit: dcs:MicrogramsPerCubicMeter
value: C:PM25CTPollution->Value

Node: E:PM25CTPollution->E2
observationAbout: C:PM25CTPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PM25CTPollution->date
variableMeasured: dcs:Mean_Concentration_AirPollutant_PM2.5_StandardError
variableMeasured: C:PM25CTPollution->StatisticalVariable
observationPeriod: "P24H"
value: C:PM25CTPollution->Error
value: C:PM25CTPollution->Value
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,34 @@ Node: E:PM25CountyPollution->E1
observationAbout: C:PM25CountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PM25CountyPollution->date
value: C:PM25CountyPollution->PM25_mean_pred
value: C:PM25CountyPollution->pm25_mean_pred
observationPeriod: "P24H"
unit: μg/m3
unit: MicrogramsPerCubicMeter
variableMeasured: dcs:Mean_Concentration_AirPollutant_PM2.5

Node: E:PM25CountyPollution->E2
observationAbout: C:PM25CountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PPM25CountyPollution->date
value: C:PM25CountyPollution->PM25_med_pred
value: C:PM25CountyPollution->pm25_med_pred
observationPeriod: "P24H"
unit: μg/m3
unit: MicrogramsPerCubicMeter
variableMeasured: dcs:Median_Concentration_AirPollutant_PM2.5

Node: E:PM25CountyPollution->E3
observationAbout: C:PM25CountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PM25CountyPollution->date
value: C:PM25CountyPollution->PM25_max_pred
value: C:PM25CountyPollution->pm25_max_pred
observationPeriod: "P24H"
unit: μg/m3
unit: MicrogramsPerCubicMeter
variableMeasured: dcs:Max_Concentration_AirPollutant_PM2.5

Node: E:PM25CountyPollution->E4
observationAbout: C:PM25CountyPollution->dcid
typeOf: dcs:StatVarObservation
observationDate: C:PM25CountyPollution->date
value: C:PM25CountyPollution->PM25_pop_pred
value: C:PM25CountyPollution->pm25_pop_pred
observationPeriod: "P24H"
unit: μg/m3
unit: MicrogramsPerCubicMeter
variableMeasured: dcs:PopulationWeighted_Concentration_AirPollutant_PM2.5
70 changes: 57 additions & 13 deletions scripts/us_cdc/environmental_health_toxicology/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,62 @@ These data were collected as part of the [CDC National Environment Public Health

### Import Procedure

#### Processing Steps

To clean the air quality data files, run:

1. Import name: CDC_PM25CensusTract

To download the air quality data files, run:
```
python3 download_files.py CDC_PM25CensusTract
```
Command to process the file
===========================
```
$ python3 scripts/us_cdc/environmental_health_toxicology/parse_air_quality.py CDC_PM25CensusTract
```
2. Import name: CDC_OzoneCensusTract

To download the air quality data files, run:
```
python3 download_files.py CDC_OzoneCensusTract
```
Command to process the file
===========================
```
$ python3 scripts/us_cdc/environmental_health_toxicology/parse_air_quality.py CDC_OzoneCensusTract
```
3. Import name: CDC_PM25County

To download the air quality data files, run:
```
python3 download_files.py CDC_PM25County
```
Command to process the file
===========================
```
$ python3 scripts/us_cdc/environmental_health_toxicology/parse_air_quality.py CDC_PM25County
```
4. Import name: CDC_OzoneCounty

To download the air quality data files, run:
```
python3 download_files.py CDC_OzoneCounty
```
Command to process the file
===========================
```bash
$ python3 scripts/us_cdc/environmental_health_toxicology/parse_air_quality.py CDC_OzoneCounty
```
### Note:
=========
=> "import_configs.json" file is uploaded on the GCP which includes the configurations of the import such as source urls, input and output filenames.
GCP location: "unresolved_mcf/cdc/environmental/import_configs.json"
download_files.py and parse_air_quality.py scripts reads this config file for download and process the files respectively.
Future urls should be include in this config file for processing the upcoming data.
=> Downloaded files are available in "input_files" directory.
=> Output files are generated on "output" directory.
#### Testing

##### Test Air Quality Data Cleaning Script
Expand All @@ -94,9 +150,6 @@ To test the air quality data cleaning script, run:
```bash
$ python3 parse_air_quality_test.py
```

The expected output of this test can be found in [`small_Ozone_County_expected.csv`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/environmental_health_toxicology/test_data/small_Ozone_County_expected.csv).

##### Test Precipitation Index Data Cleaning Script

To test the precipitation index data cleaning script, run:
Expand All @@ -109,18 +162,9 @@ The expected output of this test can be found in [`small_Palmer_expected.csv`](h

#### Processing Steps

`@input_file_name` - path to the input csv file to be cleaned

`@output_file_name` - path to write the cleaned csv file

To clean the air quality data files, run:

```bash
$ python3 parse_air_quality.py input_file_name output_file_name
```

To clean the precipitation index data files, run:

```bash
$ python3 parse_precipitation_index.py input_file_name output_file_name
```

95 changes: 95 additions & 0 deletions scripts/us_cdc/environmental_health_toxicology/download_files.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json, os, requests, sys
from pathlib import Path
from absl import app, logging, flags
from retry import retry

_FLAGS = flags.FLAGS
flags.DEFINE_string('input_file_path', 'input_files', 'Input files path')
flags.DEFINE_string(
'config_file', 'gs://unresolved_mcf/cdc/environmental/import_configs.json',
'Config file path')

_MODULE_DIR = os.path.dirname(os.path.abspath(__file__))
_INPUT_FILE_PATH = None
sys.path.append(os.path.join(_MODULE_DIR, '../../../util/'))
import file_util

record_count_query = '?$query=select%20count(*)%20as%20COLUMN_ALIAS_GUARD__count'


def download_files(importname, config):

@retry(tries=3, delay=2, backoff=2)
def download_with_retry(url, input_file_name):
logging.info(f"Downloading file from URL: {url}")
response = requests.get(url)
response.raise_for_status()
if response.status_code == 200:
if not response.content:
logging.fatal(
f"No data available for URL: {url}. Aborting download.")
return
filename = os.path.join(_INPUT_FILE_PATH, input_file_name)
with file_util.FileIO(filename, 'wb') as f:
f.write(response.content)
else:
logging.error(
f"Failed to download file from URL: {url}. Status code: {response.status_code}"
)

try:
for config1 in config:
if config1["import_name"] == importname:
files = config1["files"]
for file_info in files:
url_new = file_info["url"]
logging.info(f"URL from config file {url_new}")
input_file_name = file_info["input_file_name"]
logging.info(f"Input File Name {input_file_name}")

get_record_count = requests.get(
url_new.replace('.csv', record_count_query))
if get_record_count.status_code == 200:
record_count = json.loads(
get_record_count.text
)[0]['COLUMN_ALIAS_GUARD__count']
logging.info(
f"Numbers of records found for the URL {url_new} is {record_count}"
)
url_new = f"{url_new}?$limit={record_count}&$offset=0"
download_with_retry(url_new, input_file_name)

except Exception as e:
logging.fatal(f"Error downloading URL {url_new} - {e}")


def main(_):
"""Main function to download the csv files."""
global _INPUT_FILE_PATH
_INPUT_FILE_PATH = os.path.join(_FLAGS.input_file_path)
_INPUT_FILE_PATH = os.path.join(_MODULE_DIR, _FLAGS.input_file_path)
Path(_INPUT_FILE_PATH).mkdir(parents=True, exist_ok=True)
importname = sys.argv[1]
logging.info(f'Loading config: {_FLAGS.config_file}')
with file_util.FileIO(_FLAGS.config_file, 'r') as f:
config = json.load(f)
download_files(importname, config)
logging.info("Successfully downloaded the source data...!!!!")


if __name__ == "__main__":
app.run(main)
Loading

0 comments on commit 19676f8

Please sign in to comment.