DSHS county trend scraper

Purpose

To gather daily new case totals for each county in the state and calculate a set of 14 7-day rolling averages used to visualize how each county is doing.

Procedure

This scraper hits two DSHS files each day. The first is DSHS's daily feed of cases by county, found here. The second is a general configuration file for DSHS that contains the last update date. If the update date is later than the last update within our trend file, the first file is used to update our file of 7-day averages.

What's here

scraper.py: Set of functions to complete the daily update and repair files should a daily update be missed.
service.py: File run on AWS Lambda. Runs the daily update and uploads the resulting data file back to AWS S3.
utils.py: Simple function that handles the uploading of files to S3.
zappa_settings.json: Zappa configuration file containing project name, description runtime environment, and most importantly, schedule for the scraper to run.

Developing locally

Download the repository and run $ pipenv install --development.

Copy the .env-example file and rename it .env. Add your own AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY. For messaging to slack, you'll also need our SLACK_TOKEN.

Other environment variables:

TREND_DATA_FILE: url path of your current json data file. Results of scraper are output to this url as well.
REPAIR_FILE: url path of a copy of the current json file. This file is edited to backfill missing data and used to repair averages for missing periods of time.
TARGET_BUCKET: sub directory path to bucket where json file lives. Combined with ROOT_BUCKET url for complete file path of json data file.
ROOT_BUCKET: Root AWS bucket where data is stored. Combined with TARGET_BUCKET to construct complete AWS file path to json file

Editing the scraper

This project uses zappa to upload and schedule the scraper to our AWS Lambda. After making changes to the scraper, run pipenv run zappa update to push those changes to Lambda. Scheduling is handled via the zappa-settings.json file. The events key is an array of events objects. The service.handler opject has an expression key that can either take a schedule in cron format or a rate (rate(12 hours)).

Running the scraper locally

To run the scraper locally, run the following command in the command line:

$ pipenv run python service.py

Note, this will simulate a scheduled scraper run, so any files generated by this will be uploaded to S3. To run or test just the scraper locally, run:

$ pipenv run python scraper.py

Deploying the scraper to lambda

$ pipenv run zappa update

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.env-example		.env-example
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
scraper.py		scraper.py
service.py		service.py
utils.py		utils.py
zappa_settings.json		zappa_settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSHS county trend scraper

Purpose

Procedure

What's here

Developing locally

Editing the scraper

Running the scraper locally

Deploying the scraper to lambda

About

Releases

Packages

Contributors 2

Languages

DallasMorningNews/COVID-county-trend-scraper

Folders and files

Latest commit

History

Repository files navigation

DSHS county trend scraper

Purpose

Procedure

What's here

Developing locally

Editing the scraper

Running the scraper locally

Deploying the scraper to lambda

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages