Skip to content

Convert Google Symptoms pipeline to pull data from BigQuery #699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Feb 10, 2021

Conversation

nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Jan 15, 2021

Description

Switch from pulling Google Symptoms data from GitHub (now deprecated) to pulling directly from the relevant tables in BigQuery.

To reduce BigQuery usage, this pulls only the required columns (open_covid_region_code, date, and symptom columns) for dates between the export start date and the current date that do not appear in any file names in the receiving directory. An additional 6 days prior are pulled for calculating smoothed indicators.

Changelog

Itemize code/test/documentation changes and files added/removed.

  • Modifies pull.py::pull_gs_data() to act as a high-level function
  • Creates supporting functions in pull.py, such as pulling a single geo type at a time, getting the list of dates to retrieve, and formatting the query string
  • Adds tests for these functions
  • Adds new field to params.json to support BigQuery API credentials

@jingjtang
Copy link
Contributor

jingjtang commented Jan 28, 2021

small linting errors

************* Module delphi_google_symptoms.run
delphi_google_symptoms/run.py:13:16: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:14:22: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:74:0: C0303: Trailing whitespace (trailing-whitespace)
delphi_google_symptoms/run.py:11:0: C0411: standard import "import time" should be placed before "import numpy as np" (wrong-import-order)
************* Module delphi_google_symptoms.pull
delphi_google_symptoms/pull.py:5:0: W0611: Unused date imported from datetime (unused-import)

@jingjtang
Copy link
Contributor

@krivard Do we have the smoothing utils merged? According to #306, it seems not. Otherwise, we can directly switch to the smoothing utils.

@chinandrew
Copy link
Contributor

@krivard Do we have the smoothing utils merged? According to #306, it seems not. Otherwise, we can directly switch to the smoothing utils.

It's merged and available in delphi_utils.smooth

https://github.com/cmu-delphi/covidcast-indicators/blob/main/_delphi_utils_python/delphi_utils/smooth.py

@krivard
Copy link
Contributor

krivard commented Jan 29, 2021

small linting errors

@chinandrew any idea why this would be failing linting when run locally, but passing in CI?

@krivard
Copy link
Contributor

krivard commented Jan 29, 2021

@krivard Do we have the smoothing utils merged?

Not relevant to this PR -- this is just for switching to BigQuery. We'll handle switching to the smoother in a separate effort before reactivating this indicator in production.

(PRs are meant to be lightweight and easy to review; piling everything into a single PR makes it harder to tell if something is wrong)

@chinandrew
Copy link
Contributor

small linting errors

@chinandrew any idea why this would be failing linting when run locally, but passing in CI?

That's interesting...they're failing on CI too but the check is passing. Exit code is 0 for me, which is odd. They haven't release a new version recently either, so a bit confused why thats happening.

@chinandrew
Copy link
Contributor

chinandrew commented Jan 29, 2021

I think I figured it out. make lint runs 2 lint commands (EDIT: strung together with a semicolon), and if the first fails but the second passes, the exit code corresponds to the last command's and is 0. I'll work on a fix.

Old set of tables, one per year and per country/state/county level, has
been removed. The new set of tables, one per country/state/county level,
will have static names and will be continually updated with new dates.

Function dependence on table year was removed. Logic to handle a
table-not-found error was removed -- this was originally meant to print
a message and continue execution if a year-table had not yet been
created. Since the new tables should always exist, a table-not-found
error should stop execution.
Since BigQuery tables are currently not partitioned by date, each
query processes and bills for all rows in the table regardless of filters
applied (date and country, at the moment). To take advantage of this,
this pipeline will pull all dates from the specified start date to the
current date by default. This setting should be updated to a shorter
date window (~14 days seems reasonable) if/when the tables are converted
to "partitioned" format.
@nmdefries nmdefries requested a review from jingjtang February 1, 2021 20:45
@nmdefries
Copy link
Contributor Author

Switched this to using the archive differ to manage which dates get added to the API. At the moment, the BigQuery tables process (and bill for) all rows in a table, even if filtering is used, so by default the indicator pulls all dates between "now" and the specified start date.

I've asked the Google team to partition the tables by date so that filtering actually reduces the number of rows processed. If/when that gets added, we'll want to move to pulling data from a narrower date range maybe a couple of weeks wide depending on lag and backfill.

@jingjtang
Copy link
Contributor

jingjtang commented Feb 8, 2021

@nmdefries I didn't look at that previously since the data is not updated regularly, but how is the backfill status for google symptoms now? Do we want to stick to the 5-day lag or not?

By the way, I compared the output files with the ones generated by the old code .
state_ageusia_smoothed_comparison.xlsx
state_anosmia_smoothed_comparison.xlsx
state_sum_smoothed_comparison.xlsx
They match with each other.

@nmdefries
Copy link
Contributor Author

nmdefries commented Feb 10, 2021

I just pulled the data; it is currently 8 days behind. Once the tables move to daily updates (TBD but presumably in progress; plan to check in with Google later this week), we'll have a better idea of lag, but I suspect that it will be about a week.

The pipeline here currently pulls data from all dates (start_date through "now") so for now we don't need to worry about large lag causing days to be missed. With the switch to daily table updates, I was planning on only pulling the last 14 days (+ extra to calculated smoothed signals) by default, to save on BigQuery costs.

By the way, I compared the output files with the ones generated by the old code... They match with each other.

That's great, thanks for checking!!

@nmdefries
Copy link
Contributor Author

nmdefries commented Feb 10, 2021

The tables do not appear to change over time (no backfill). I checked equality of county anosmia and ageusia values on Jan 25 pulled today vs pulled 9 days ago.

Copy link
Contributor

@jingjtang jingjtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@krivard krivard merged commit 392a2be into main Feb 10, 2021
@krivard krivard deleted the gs-pull-from-bigquery branch February 10, 2021 20:44
@nmdefries nmdefries restored the gs-pull-from-bigquery branch February 22, 2021 14:52
@nmdefries nmdefries deleted the gs-pull-from-bigquery branch February 22, 2021 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants