We will work on creating a small example of a Data Lake in GCS
gcloud compute instances create orders-app \
--zone=europe-west1-b \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--subnet=projects/<your-project-id>/regions/europe-west1/subnetworks/default \
--machine-type=e2-micro \
--source-machine-image=projects/<your-project-id>/global/machineImages/<MACHINE_IMAGE_NAME> \
--boot-disk-size=10GB
gcloud compute instances create delivery-app \
--zone=europe-west1-b \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--subnet=projects/<your-project-id>/regions/europe-west1/subnetworks/default \
--machine-type=e2-micro \
--source-machine-image=projects/<your-project-id>/global/machineImages/<MACHINE_IMAGE_NAME> \
--boot-disk-size=10GB
To create a Cloud SQL instance, follow these steps:
Run the following command to create the instance:
gcloud sql instances create edem-postgres \
--database-version=POSTGRES_15 \
--tier=db-f1-micro \
--region=europe-west1 \
--availability-type=zonal \
--storage-size=100 \
--no-deletion-protection \
--authorized-networks=0.0.0.0/0 \
--root-password=EDEM2425
Run the following command to create a user for the database:
gcloud sql users create postgres \
--instance=edem-postgres \
--password=EDEM2425
Run the following command to create the ecommerce database:
gcloud sql databases create ecommerce \
--instance=edem-postgres
Run the following command to create the BQ Datasets
gcloud bigquery datasets create orders_bronze --location=europe-west1
gcloud bigquery datasets create delivery_bronze --location=europe-west1
Run the following commands to create the BQ Tables
gcloud bigquery tables create orders_bronze.customers \
--schema="id:INT64,customer_name:STRING,email:STRING"
gcloud bigquery tables create orders_bronze.products \
--schema="id:INT64,product_name:STRING,price:FLOAT64"
gcloud bigquery tables create orders_bronze.orders \
--schema="id:INT64,customer_id:INT64,created_at:TIMESTAMP,total_price:FLOAT64"
gcloud bigquery tables create orders_bronze.order_products \
--schema="order_id:INT64,product_id:INT64,quantity:INT64,price:FLOAT64"
gcloud bigquery tables create delivery_bronze.raw_events_delivery \
--schema="subscription_name:STRING,message_id:STRING,publish_time:TIMESTAMP,data:JSON,attributes:JSON" \
--time_partitioning_field=publish_time \
--clustering_fields=subscription_name,message_id \
--labels=source=bq_subs
Run the following commands to create the pub/sub topics and subscriptions
# Create Pub/Sub Topics
gcloud pubsub topics create delivery-events
gcloud pubsub topics create order-events
gcloud pubsub topics create delivery-events-dead-letter
# Create the order-events subscription
gcloud pubsub subscriptions create order-events-sub \
--topic=order-events
# Create Pub/Sub Subscriptions with BigQuery Sink
gcloud pubsub subscriptions create delivery-events-bq-sub \
--topic=delivery-events \
--bigquery-table=delivery_bronze.raw_events_delivery \
--bigquery-use-table-schema=false \
--bigquery-write-metadata \
--dead-letter-topic=topics/delivery-events-dead-letter \
--max-delivery-attempts=5
# Create Dead Letter Subscription
gcloud pubsub subscriptions create delivery-events-dead-letter-sub \
--topic=delivery-events-dead-letter
cd EDEM2425/gcp_datalake/excercise_end2end/terraform
Modify the service_account_email
variable in the variables.tf
file with the email of the service account of your project.
terraform init
terraform plan
terraform apply
Once we finished the excercise, we can destroy the infrastructure by running:
terraform destroy
### In both instances
-
Log in to the instance:
gcloud compute ssh <instance-name> --zone=europe-west1-b
-
Run a git pull inside the repository:
cd EDEM2425 git pull
-
Move to the correct directory:
cd EDEM2425/gcp_datawarehouse/excercise_end2end
-
Create a virtual environment:
python3 -m venv .venv
-
Activate the venv:
source .venv/bin/activate
-
Install the requirements:
pip install -r requirements.txt
- Run the
orders-app
using nohup so that it runs in the background (if you close the terminal, the app will still be running):nohup bash -c 'HOST_IP=<your-cloud-sql-ip> GCS_BUCKET_NAME=<your-bucket-name> PROJECT_ID=<your-project-id> python -m orders_app.orders_to_db.main' > output.log 2>&1 &
This will start creating orders, store them in the database and publish confirmation events to the order-events
topic.
If you want to see the logs, run the following command:
tail -f output.log
- Run the following command to start the
delivery-app
:nohup bash -c 'PROJECT_ID=<your-project-id> python -m delivery_app.main' > output.log 2>&1 &
If you want to see the logs, run the following command:
tail -f output.log
This will start consuming the events from the order-events
topic and publish delivery events to the delivery-events
topic.
Now that we have the parquet file created in our bucket, we will create a external table in BigQuery to access its content.
The steps you have to follow are:
- Go to the dataset
orders_bronze
- Click on create table
- Select
Google Cloud Storage
inCreate table from
- Select on Browse and choose your file from correct bucket
- In the table field write
raw_additional_products_info
- In table type select
External Table
- Select the option
Auto detect
in the Schema Section - Click on
Create Table
Now we can deploy the EL pipeline to syncronize the PostgresDB database with BigQuery.
To do so, we need to run the following commands:
-
Create your application-default credentials. This will allow us to identify to the GCP services with our user account (in case you haven't done it yet):
gcloud auth application-default login
-
Create the venv, activate it and install the requirements:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt
-
Run the script to syncronize the PostgresDB database
POSTGRES_IP=<postgres-ip> GCP_PROJECT=<your-gcp-project-id> python -m analytical_layer.el_orders.main
If you are using Windows CMD, you can run:
set POSTGRES_IP=<postgres-ip> set GCP_PROJECT=<your-gcp-project-id> python -m analytical_layer.el_orders.main
If you are using Windows Powershell, you can run:
$env:POSTGRES_IP = "<postgres-ip>"; $env:GCP_PROJECT = "<your-gcp-project-id>"; python -m analytical_layer.el_orders.main
Now that we have the EL for orders_app. Let's use DBT to create a view for the delivery events.
-
Create a directory with the name of the project in the dbt directory:
mkdir dbt_project_example
-
Move to the dbt directory:
cd dbt_project_example
-
Install dbt-bigquery:
pip install dbt-bigquery
-
Initialize the dbt project:
dbt init edem_project
This will create the structure of the dbt project.
Now, you have a template prepare with the modifications needed to create the views for the delivery events.
- Go to the folder
dbt_template
in theexcercise_end2end
directory. - Copy the dbt_project.yml content to the
dbt_project_example/edem_project/dbt_project.yml
file. - Copy the content of the
models
folder to thedbt_project_example/edem_project/models
folder. - Copy the content of the
macros
folder to thedbt_project_example/edem_project/macros
folder. - Run the command to create the views:
cd dbt_project_example dbt run --select expanded_delivery_events
This will create the view for the delivery events in the delivery
dataset in BigQuery.
We have also some tables for getting aggregations by running the analytics folder.
Run the following command to create the tables:
dbt run --select analytics
Let's now deploy Metabase to visualize the data from BigQuery.
-
Go to the excercise directory in your local machine:
cd EDEM2425/gcp_datawarehouse/excercise_end2end
-
Deploy the docker-compose of the analytical-layer:
cd analytical-layer docker-compose up -d
-
Go to the browser and access to
http://localhost:3000
to access Metabase.