Skip to content

Commit

Permalink
Add instructions to diagnose external tables in GCS deprecation (#2863)
Browse files Browse the repository at this point in the history
  • Loading branch information
SorenSpicknall authored Aug 4, 2023
1 parent 8ed0d4f commit c421b37
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions runbooks/data/deprecation-stored-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Occasionally, we want to assess our Google Cloud Storage buckets for outdatednes

3. For the non-test buckets that constitute the deprecation candidate list, the path forward relies on investigation of internal project configuration and conversation with data stakeholders. Some data may need to be kept in place because it is frequently accessed despite being infrequently updated (NTD data or static website assets, for instance). Some data may need to be cold-stored rather than deleted outright because it represents raw data collected once that can't otherwise be recovered, or to conform with regulatory requirements, or to provide a window for future research access. Each of the following steps should be taken to determine which path to take:
* Search the source code of the [data-infra repository](https://github.com/cal-itp/data-infra), the [data-analyses repository](https://github.com/cal-itp/data-analyses), and the [reports repository](https://github.com/cal-itp/reports) for the name of the bucket, as well as the environment variables [set in Cloud Composer](https://console.cloud.google.com/composer/environments/detail/us-west2/calitp-airflow2-prod/variables?project=cal-itp-data-infra). If you find it referenced anywhere, investigate whether the reference is in active use. For an extra step of safety, you could also search the entire Cal-ITP GitHub organization's source code via GitHub's web user interface.
* Note: [External tables](https://cloud.google.com/bigquery/docs/external-tables) in BigQuery, created from GCS objects via [our `create_external_tables` DAG](https://o1d2fa0877cf3fb10p-tp.appspot.com/dags/create_external_tables/grid) in Airflow, do not produce read or write data that shows up in the GCS request count metric we used in step one. If you find a reference to a deprecation candidate bucket within the [`create_external_tables` subfolder](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) of the data-infra repository, you should check [BigQuery audit logs](https://cloud.google.com/bigquery/docs/reference/auditlogs/#data_access_data_access) to see whether people are querying the external tables that rely on the deprecation candidate bucket (and if so, eliminate it from the deprecation list).
* Post in `#data-warehouse-devs` and any other relevant channels in Slack (this may vary by domain; for example, if investigating a bucket related to GTFS quality, you may post in `#gtfs-quality`). Ask whether anybody knows of ongoing use of the bucket(s) in question. If there are identifiable stakeholders who aren't active in Slack, like external research partners, reach out to them directly.

4. For each bucket that hasn't been removed from the deprecation list via the investigation in the last step, create a new bucket named "[EXISTING BUCKET NAME]-deprecated" and follow [these steps](https://cloud.google.com/storage/docs/moving-buckets#permissions-console) to transfer the original bucket contents into the newly created bucket(s). Delete the original bucket(s), inform stakeholders about the newly deprecated buckets via `#data-warehouse-devs` and other relevant channels, and monitor for two weeks for any new code or process breakages related to the deletion of the old buckets.
Expand Down

0 comments on commit c421b37

Please sign in to comment.