ROI GCP Training Demos

Goal

The goal of this demo is to show relative performance of different schema choices in BigQuery.

After setup, there will be tables with the same data stored in tables with varying schemas:

Normalized w/multiple tables
Fully denormalized
Single-level nesting (one row per order)
Single-level nesting with partitioning
Single-level nesting with partitioning and clustering

You can demo the changes in query performance and cost with these different schema.

Incidentally, this also provides the potential to demo Dataflow, BigQuery load techniques, doing intra-BigQuery ETL, etc.

Setup

Notes

There are two ways to do setup.

Start with a pre-created normalized data set
Start by creating your own normalized data set

The former approach uses this normalized data set:

roi-bq-demo.bq_demo

which has the following tables:

customer: 75M rows
product: 10K rows
order: 7.5B rows
line_item: 75B rows

This is the largest dataset we can reasonably store for instructor use. If you want a larger dataset, you can follow the instructions for creating your own normalized tables.

Setup - general

Log into the cloud console, and select an appropriate project. Queries used to derive the not-normalized tables and do the demos are expensive, so choose your project wisely (i.e. do not use a project billable to ROI Training even if you can).
In your target project, make sure that there is a dataset named bq_demo (create it if necessary).

Setup - standard normalized tables

Run the query in load_data.sql. This will take about 70 minutes, cost $150, and create four new tables in your target project/dataset:

denorm
nested_once
table_nested_partitioned
table_nested_partitioned_clustered

Setup - create your own normalized tables

Open Cloud Shell
In your project of choice, create a bucket to hold the data files

gsutil mb -c regional -l us-central1 gs://$DEVSHELL_PROJECT_ID

Create and activate a virtual environment

virtualenv ~/venvs/bq-demo
source ~/venvs/bq-demo/bin/activate

Install the required Python packages

pip install -r requirements.txt

Generate the data files for the normalized BQ tables. The amount of time this takes will depend on your quotas in the project where it's running - with 100 16cpu nodes, it takes about 70 minutes total and costs $xx. You will definitely need to run the Dataflow job in an environment where you have plenty of quota. Make sure to replace the placeholders in the command-line below:

python generate_data.py \
    --project=<project-id> \
    --bucket=<bucket-name> \
    --save_main_session \
    --runner=DataflowRunner \
    --worker_machine_type=<machine-type> \
    --region=<region> \
    --customers=<# customers> \
    --products=10000 \
    --orders=<#orders> \

Load the data from the generated files in GCS into BigQuery. This will take on the order of 75 minutes to complete, and cost about $150.

. ./load_data.sh

When prompted, enter the project name where the data will reside, and the bucket where Dataflow stored the import files.

Demo

Load the BQ user interface in the project where you have the dataset. The queries are written with the project name in the table reference, so if you run the queries in another project you'll need to add the project name at every appropriate line in the query.

Run the query found in norm_query.sql
- Note that amount of data processed
- Note the structure of the query
- Note the time taken to complete the query
- Example run: 2TB, 110 seconds, $10
Run the query found in denorm_query.sql
- Note that amount of data processed
- Note the structure of the query
- Note the time taken to complete the query
- Note you get the same results as with 1
- Example run: 2.2TB, 24 seconds, $11
Run the first query found in nested_queries.sql
- Note that amount of data processed
- Note the structure of the query
- Note the time taken to complete the query
- Note you get the same results as with 1
- Example run: 1.2TB, 6.4 seconds, $7
Run the second query found in nested_queries.sql
- Note that amount of data processed
- Note the structure of the query
- Note the time taken to complete the query
- Note you get the same results as with 1
- Example run: 0.1TB, 6.0 seconds, $0.50
Run queries 3-4-5 from nested_queries.sql
- Note the amount of data processed for each
- Note the query time for each
- You should seeing decreases for each one
- Example runs
  - 1.2TB, 6.3 seconds, $7
  - 614G, 3.8 seconds, $3
  - 11GB, 3.2 seconds, $.06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ROI GCP Training Demos

Goal

Setup

Notes

Setup - general

Setup - standard normalized tables

Setup - create your own normalized tables

Demo

Files

README.md

Latest commit

History

README.md

File metadata and controls

ROI GCP Training Demos

Goal

Setup

Notes

Setup - general

Setup - standard normalized tables

Setup - create your own normalized tables

Demo