Convert Google Symptoms pipeline to pull data from BigQuery #699

nmdefries · 2021-01-15T17:00:47Z

Description

Switch from pulling Google Symptoms data from GitHub (now deprecated) to pulling directly from the relevant tables in BigQuery.

To reduce BigQuery usage, this pulls only the required columns (open_covid_region_code, date, and symptom columns) for dates between the export start date and the current date that do not appear in any file names in the receiving directory. An additional 6 days prior are pulled for calculating smoothed indicators.

Changelog

Itemize code/test/documentation changes and files added/removed.

Modifies pull.py::pull_gs_data() to act as a high-level function
Creates supporting functions in pull.py, such as pulling a single geo type at a time, getting the list of dates to retrieve, and formatting the query string
Adds tests for these functions
Adds new field to params.json to support BigQuery API credentials

…tables. add docstrings.

…found for

jingjtang · 2021-01-28T22:55:54Z

small linting errors

************* Module delphi_google_symptoms.run
delphi_google_symptoms/run.py:13:16: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:14:22: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:74:0: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:11:0: C0411: standard import "import time" should be placed before "import numpy as np" (wrong-import-order)
************* Module delphi_google_symptoms.pull
delphi_google_symptoms/pull.py:5:0: W0611: Unused date imported from datetime (unused-import)

jingjtang · 2021-01-29T01:29:07Z

@krivard Do we have the smoothing utils merged? According to #306, it seems not. Otherwise, we can directly switch to the smoothing utils.

chinandrew · 2021-01-29T01:38:24Z

@krivard Do we have the smoothing utils merged? According to #306, it seems not. Otherwise, we can directly switch to the smoothing utils.

It's merged and available in delphi_utils.smooth

https://github.com/cmu-delphi/covidcast-indicators/blob/main/_delphi_utils_python/delphi_utils/smooth.py

krivard · 2021-01-29T14:46:41Z

small linting errors

@chinandrew any idea why this would be failing linting when run locally, but passing in CI?

krivard · 2021-01-29T14:49:19Z

@krivard Do we have the smoothing utils merged?

Not relevant to this PR -- this is just for switching to BigQuery. We'll handle switching to the smoother in a separate effort before reactivating this indicator in production.

(PRs are meant to be lightweight and easy to review; piling everything into a single PR makes it harder to tell if something is wrong)

chinandrew · 2021-01-29T17:06:10Z

small linting errors

@chinandrew any idea why this would be failing linting when run locally, but passing in CI?

That's interesting...they're failing on CI too but the check is passing. Exit code is 0 for me, which is odd. They haven't release a new version recently either, so a bit confused why thats happening.

chinandrew · 2021-01-29T17:11:05Z

I think I figured it out. make lint runs 2 lint commands (EDIT: strung together with a semicolon), and if the first fails but the second passes, the exit code corresponds to the last command's and is 0. I'll work on a fix.

google_symptoms/delphi_google_symptoms/pull.py

Old set of tables, one per year and per country/state/county level, has been removed. The new set of tables, one per country/state/county level, will have static names and will be continually updated with new dates. Function dependence on table year was removed. Logic to handle a table-not-found error was removed -- this was originally meant to print a message and continue execution if a year-table had not yet been created. Since the new tables should always exist, a table-not-found error should stop execution.

…ings. add num days to fetch param

Since BigQuery tables are currently not partitioned by date, each query processes and bills for all rows in the table regardless of filters applied (date and country, at the moment). To take advantage of this, this pipeline will pull all dates from the specified start date to the current date by default. This setting should be updated to a shorter date window (~14 days seems reasonable) if/when the tables are converted to "partitioned" format.

nmdefries · 2021-02-01T20:51:13Z

Switched this to using the archive differ to manage which dates get added to the API. At the moment, the BigQuery tables process (and bill for) all rows in a table, even if filtering is used, so by default the indicator pulls all dates between "now" and the specified start date.

I've asked the Google team to partition the tables by date so that filtering actually reduces the number of rows processed. If/when that gets added, we'll want to move to pulling data from a narrower date range maybe a couple of weeks wide depending on lag and backfill.

jingjtang · 2021-02-08T03:17:52Z

@nmdefries I didn't look at that previously since the data is not updated regularly, but how is the backfill status for google symptoms now? Do we want to stick to the 5-day lag or not?

By the way, I compared the output files with the ones generated by the old code .
state_ageusia_smoothed_comparison.xlsx
state_anosmia_smoothed_comparison.xlsx
state_sum_smoothed_comparison.xlsx
They match with each other.

nmdefries · 2021-02-10T20:05:46Z

I just pulled the data; it is currently 8 days behind. Once the tables move to daily updates (TBD but presumably in progress; plan to check in with Google later this week), we'll have a better idea of lag, but I suspect that it will be about a week.

The pipeline here currently pulls data from all dates (start_date through "now") so for now we don't need to worry about large lag causing days to be missed. With the switch to daily table updates, I was planning on only pulling the last 14 days (+ extra to calculated smoothed signals) by default, to save on BigQuery costs.

By the way, I compared the output files with the ones generated by the old code... They match with each other.

That's great, thanks for checking!!

nmdefries · 2021-02-10T20:15:57Z

The tables do not appear to change over time (no backfill). I checked equality of county anosmia and ageusia values on Jan 25 pulled today vs pulled 9 days ago.

jingjtang

lgtm！

nmdefries added 30 commits January 14, 2021 18:42

modify pull_gs_data to pull data from bigquery

563e9bb

loop by year to fetch data from separate yearly tables

e63920b

create generic data-pull function with geolevel as arg

4284199

hand args from params to pull_gs_data. remove base_url

fb39b72

update error message

1d10dde

move column rename from preprocess to initial datapull

46fa1f7

add BigQuery credential fields to params

908a6e2

get tests running

b589c68

add tests for getting and formatting dates

a14ae12

update data for preprocess tests

14ded5f

update pull tests. mocking wip

b27fbbe

add BigQuery credentials support. add exception catching for missing …

c97ab97

…tables. add docstrings.

add path-to-json field in params

7c57ab7

move query formatting to separate function

f440431

update README

0e948cf

test updates. move credentials to separate function

a62c8ce

get pull tests working

692f3a8

switch expected_date calculatin to use canonical pandas func

da0491b

have run_as_module create receiving dir if does not exist

e7a9cd8

add comments in pull test

796308c

add mock to conftest::run_as_module to bypass API credentials

2d0e18c

add test data for pull and smooth tests

d2ffe50

update test params

5b6d664

read test data date column in as date

b80ae78

add description of how to recreate test data

9143855

remove unused test data

a8d947e

create receiving dir if doesn't exist when finding existing output files

ae82eaf

handle empty dataframes. print message about date range new data was …

c3148b4

…found for

add message reporting dates retrieved

5246882

lint improvements

38cf6bb

jingjtang reviewed Jan 29, 2021

View reviewed changes

google_symptoms/delphi_google_symptoms/pull.py Outdated Show resolved Hide resolved

chinandrew mentioned this pull request Jan 29, 2021

Update makefiles to fix linting failures not failing #753

Merged

nmdefries added 11 commits January 29, 2021 16:04

update tests to reflect func changes from new BQ tables

fdd67ae

lint updates

c77c989

mock empty test files

1e56761

Merge branch 'main' into gs-pull-from-bigquery

c830b13

lint updates

a2401b6

suppress invalid unused-import lint error

1629e47

add ArchiveDiffer to makefile

68fedd2

remove funcs that fetch missing dates from local files. update docstr…

635fbbe

…ings. add num days to fetch param

update tests to reflect new func structure

7d2542e

nmdefries requested a review from jingjtang February 1, 2021 20:45

jingjtang approved these changes Feb 10, 2021

View reviewed changes

krivard merged commit 392a2be into main Feb 10, 2021

krivard deleted the gs-pull-from-bigquery branch February 10, 2021 20:44

nmdefries restored the gs-pull-from-bigquery branch February 22, 2021 14:52

nmdefries deleted the gs-pull-from-bigquery branch February 22, 2021 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert Google Symptoms pipeline to pull data from BigQuery #699

Convert Google Symptoms pipeline to pull data from BigQuery #699

Uh oh!

nmdefries commented Jan 15, 2021 •

edited

Loading

Uh oh!

jingjtang commented Jan 28, 2021 •

edited

Loading

Uh oh!

jingjtang commented Jan 29, 2021

Uh oh!

chinandrew commented Jan 29, 2021

Uh oh!

krivard commented Jan 29, 2021

Uh oh!

krivard commented Jan 29, 2021 •

edited

Loading

Uh oh!

chinandrew commented Jan 29, 2021

Uh oh!

chinandrew commented Jan 29, 2021 •

edited

Loading

Uh oh!

Uh oh!

nmdefries commented Feb 1, 2021

Uh oh!

jingjtang commented Feb 8, 2021 •

edited

Loading

Uh oh!

nmdefries commented Feb 10, 2021 •

edited

Loading

Uh oh!

nmdefries commented Feb 10, 2021 •

edited

Loading

Uh oh!

jingjtang left a comment

Uh oh!

Uh oh!

Convert Google Symptoms pipeline to pull data from BigQuery #699

Convert Google Symptoms pipeline to pull data from BigQuery #699

Uh oh!

Conversation

nmdefries commented Jan 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changelog

Uh oh!

jingjtang commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang commented Jan 29, 2021

Uh oh!

chinandrew commented Jan 29, 2021

Uh oh!

krivard commented Jan 29, 2021

Uh oh!

krivard commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinandrew commented Jan 29, 2021

Uh oh!

chinandrew commented Jan 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nmdefries commented Feb 1, 2021

Uh oh!

jingjtang commented Feb 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmdefries commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmdefries commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jingjtang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nmdefries commented Jan 15, 2021 •

edited

Loading

jingjtang commented Jan 28, 2021 •

edited

Loading

krivard commented Jan 29, 2021 •

edited

Loading

chinandrew commented Jan 29, 2021 •

edited

Loading

jingjtang commented Feb 8, 2021 •

edited

Loading

nmdefries commented Feb 10, 2021 •

edited

Loading

nmdefries commented Feb 10, 2021 •

edited

Loading