statline-bq
is an open source library built to upload datasets from the Dutch CBS (Central Bureau of Statistics) into Google BigQuery, where they could be accessed via SQL. While intended mainly to be a helper library within the NL Open Data project, it can also be used as a standalone CLI application. When provided with list of valid dataset IDs (i.e. "83583NED") it downloads the datasets, and uploads them to Google Big Query, where the datasets can be interacted with using SQL.
NOTE - you must have the appropriate permissions on an existing Google Cloud Project prior to running statline-bq.
In order to take advantage of open data, the ability to mix various datasets together must be available. As of now, in order to to that, a substantial knowledge of programming and data engineering must be available to any who wishes to do so. This library, and its parent project NL Open Data, aim to make that task easier.
Using pip:
pip install statline_bq
-> NOT IMPLEMENTED YET
Using Poetry: Being a Poetry managed package, installing via Poetry is also possible. Assuming Poetry is already installed:
- Clone the repository
- From your local clone's root folder, run
poetry install
There are two elements that need to be configured prior to using the CLI. If using as an imported library the exact usage determines wheteher these are both needed (or just one, or none)
The GCP project id, bucket, and location should be provided by editing statline_bq/config.toml
, allowing up to 3 choices at runtime: dev
, test
and prod
. prod
is further divided into 3: cbs_dl
serving as a data lake for the BASE CBS catalog, external_dl
serving as a data lake for the THIRD-PARTY catalog and dwh
, meant for any additional or combined resources. The selection within the three prod
environments is automatically inferred at runtime, according to the source
parameter.
Note that you must nest gcp projects details correctly for them to be interperted, as seen below. You must also have the proper IAM (permissions) on the GCP projects (more details below).
Correct nesting in config file:
[gcp]
[gcp.dev]
project_id = "my_dev_project_id"
bucket = "my_dev_bucket"
location = "EU"
[gcp.test]
project_id = "my_test_project_id"
bucket = "my_test_bucket"
location = "EU"
[gcp.prod]
[gcp.prod.cbs_dl]
project_id = "my-cbs-dl"
bucket = "my-cbs-dl_bucket"
location = "EU"
[gcp.prod.external_dl]
project_id = "my-external-dl"
bucket = "my-external-dl_bucket"
location = "EU"
[gcp.prod.dwh]
project_id = "my-open-dwh"
bucket = "my-open-dwh_bucket"
location = "EU"
Additionally, the local temp paths used by the library can be configured here. If a new source
is used, it must be added both in config.toml
under [paths]
and in config.py
to the Paths
Class.
Provide a list of all CBS dataset ids that are to be uploaded to GCP. i.e.:
ids = ["83583NED", "83765NED", "84799NED", "84583NED", "84286NED"]
This should be given by directly editing statline_bq/datasets.toml
statline-bq can be used via a command line, or imported as a library.
Once the library is installed and configured, you can either used poetry run
:
- From your terminal, navigate to "my_path_to_library/statline-bq/statline_bq/"
- run
poetry run statline-bq
- That's it!
Or spawn a shell:
- From your terminal, navigate to "my_path_to_library/statline-bq/statline_bq/"
- run
poetry shell
- run
statline-bq --help
for info
- Running the whole process:
from statline_bq.utils import check_v4, cbsodata_to_gbq
from statline_bq.config import get_config
id = "83583NED" # dataset id from CBS
config = get_config("./statline_bq/config.toml") # string path to config.toml
odata_version = check_v4(id=id) # assigns 'v4' if a v4 version exists, 'v3' otherwise
cbsodata_to_gbq(
id=id,
odata_version=odata_version,
config=config
)
- Only uploading to GCS:
from statline_bq.utils import cbsodata_to_gbq
from statline_bq.config import get_config
id = ["83583NED"] # dataset id from CBS
odata_version = "v3" # odata version: 'v3' or 'v4'
config = get_config("./statline_bq/config.toml") # string path to config.toml
upload_dir = "some_folder/folder_to_upload_from/" # path to directory containing filed to be uploaded
gcs_folder = upload_to_gcs(dir=upload_dir,
odata_version=odata_version, id=id, config=config)