-
Notifications
You must be signed in to change notification settings - Fork 16
Convert Google Symptoms pipeline to pull data from BigQuery #699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tables. add docstrings.
small linting errors
|
@chinandrew any idea why this would be failing linting when run locally, but passing in CI? |
Not relevant to this PR -- this is just for switching to BigQuery. We'll handle switching to the smoother in a separate effort before reactivating this indicator in production. (PRs are meant to be lightweight and easy to review; piling everything into a single PR makes it harder to tell if something is wrong) |
That's interesting...they're failing on CI too but the check is passing. Exit code is 0 for me, which is odd. They haven't release a new version recently either, so a bit confused why thats happening. |
I think I figured it out. |
Old set of tables, one per year and per country/state/county level, has been removed. The new set of tables, one per country/state/county level, will have static names and will be continually updated with new dates. Function dependence on table year was removed. Logic to handle a table-not-found error was removed -- this was originally meant to print a message and continue execution if a year-table had not yet been created. Since the new tables should always exist, a table-not-found error should stop execution.
…ings. add num days to fetch param
Since BigQuery tables are currently not partitioned by date, each query processes and bills for all rows in the table regardless of filters applied (date and country, at the moment). To take advantage of this, this pipeline will pull all dates from the specified start date to the current date by default. This setting should be updated to a shorter date window (~14 days seems reasonable) if/when the tables are converted to "partitioned" format.
Switched this to using the archive differ to manage which dates get added to the API. At the moment, the BigQuery tables process (and bill for) all rows in a table, even if filtering is used, so by default the indicator pulls all dates between "now" and the specified start date. I've asked the Google team to partition the tables by date so that filtering actually reduces the number of rows processed. If/when that gets added, we'll want to move to pulling data from a narrower date range maybe a couple of weeks wide depending on lag and backfill. |
@nmdefries I didn't look at that previously since the data is not updated regularly, but how is the backfill status for google symptoms now? Do we want to stick to the 5-day lag or not? By the way, I compared the output files with the ones generated by the old code . |
I just pulled the data; it is currently 8 days behind. Once the tables move to daily updates (TBD but presumably in progress; plan to check in with Google later this week), we'll have a better idea of lag, but I suspect that it will be about a week. The pipeline here currently pulls data from all dates (
That's great, thanks for checking!! |
The tables do not appear to change over time (no backfill). I checked equality of county anosmia and ageusia values on Jan 25 pulled today vs pulled 9 days ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Description
Switch from pulling Google Symptoms data from GitHub (now deprecated) to pulling directly from the relevant tables in BigQuery.
To reduce BigQuery usage, this pulls only the required columns (
open_covid_region_code
,date
, and symptom columns) for dates between the export start date and the current date that do not appear in any file names in the receiving directory. An additional 6 days prior are pulled for calculating smoothed indicators.Changelog
Itemize code/test/documentation changes and files added/removed.
pull.py::pull_gs_data()
to act as a high-level functionpull.py
, such as pulling a single geo type at a time, getting the list of dates to retrieve, and formatting the query stringparams.json
to support BigQuery API credentials