dataproc-templates/notebooks/postgresql2bq at release-please--branches--main · GoogleCloudPlatform/dataproc-templates

Name	Name	Last commit message	Last commit date
parent directory ..
PostgreSqlToBigQuery_parameterize_script.py	PostgreSqlToBigQuery_parameterize_script.py
README.md	README.md
__init__.py	__init__.py
postgresql-to-bigquery-notebook.ipynb	postgresql-to-bigquery-notebook.ipynb

Jupyter Notebook Solution for migrating PostgreSQL to BigQuery DWH using Dataproc Templates

This notebook solution utilizes dataproc templates for migrating databases from PostgreSQL to BIGQUERY.

The notebook contains step by step process for a downtime based migration.

Refer Setup Vertex AI - PySpark to setup new Jupyter notebook in vertexAI. Once the setup is done navigate to /notebooks/postgresql2bq folder and open postgresql-to-bigquery-notebook.

Overview

This notebook is built on top of:

Vertex AI Jupyter Notebook
Google Cloud's Dataproc Serverless
Dataproc Templates which are maintained in this github project.

Key Benefits

Automatically generates list of tables from metadata. Alternatively, user can supply list of tables.
Identifies current primary key column name, and partitioned read properties.
Automatically uses partition reads if exceeds threshold.
Divides migration into batches and parallely migrates multiple tables.
Notebook allows you to choose modes i.e. appending data or overwrite.
Bigquery load automatically creates table if the table does not exists.

Requirements

Below configurations are required before proceeding further.

Common Parameters

PROJECT : GCP project-id
REGION : GCP region
GCS_STAGING_LOCATION : Cloud Storage staging location to be used for this notebook to store artifacts
SUBNET : VPC subnet
JARS : list of jars. For this notebook postgresql and Spark Bigquery connector jars are required in addition to the dataproc template jars
MAX_PARALLELISM : Parameter for number of jobs to run in parallel default value is 5

POSTGRESQL Parameters

POSTGRESQL_HOST : POSTGRESQL instance ip address
POSTGRESQL_PORT : POSTGRESQL instance port
POSTGRESQL_USERNAME : POSTGRESQL username
POSTGRESQL_PASSWORD : POSTGRESQL password
POSTGRESQL_DATABASE : Name of database that you want to migrate
POSTGRESQL_TABLE_LIST : List of tables you want to migrate eg ['schema.table1','schema.table2'] else provide an empty list for migration of specific schemas or the whole database. Example: [].
POSTGRESQL_SCHEMA_LIST : List of schemas. Use this if you'ld like to migrate all tables associated with specific schemas eg. ['schema1','schema2']. Otherwise, leave this parameter empty. Example: []. This comes in handy when don't want to provide names of all the tables separately but would rather prefer migrating all tables from a schema.

BigQuery Parameters

BIGQUERY_DATASET : BigQuery Target Dataset
BIGQUERY_MODE : Mode of operation at target append/overwrite

Run programmatically with parameterize script

Alternatively to running the notebook manually, we developed a "parameterize" script, using the papermill lib, to allow running notebooks programmatically from a Python script, with parameters.

Example submission:

export GCP_PROJECT=<project>
export REGION=<region>
export GCS_STAGING_LOCATION=<bucket-name>
export SUBNET=<subnet>

python run_notebook.py --script=POSTGRESQLTOBIGQUERY \
                        --postgresql.host="10.x.x.x" \
                        --postgresql.port="3306" \
                        --postgresql.username="user" \
                        --postgresql.password="password" \
                        --postgresql.database="db" \
                        --postgresql.table.list="employee" \
                        --postgresql.schema.list="" \

                        --bigquery.dataset="templates" \
                        --bigquery.mode.="overwrite"

Parameters:

python run_notebook.py --script=POSTGRESQLTOBIGQUERY --help

usage: run_notebook.py [-h]
        [--output.notebook OUTPUT.NOTEBOOK]
        --postgresql.host POSTGRESQL_HOST
        [--postgresql.port POSTGRESQL_PORT]
        --postgresql.username POSTGRESQL_USERNAME
        --postgresql.password POSTGRESQL_PASSWORD
        --postgresql.database POSTGRESQL_DATABASE
        [--postgresql.table.list POSTGRESQL_TABLE_LIST]
        [--postgresql.schema.list POSTGRESQL_SCHEMA_LIST]
        --bigquery.dataset BIGQUERY_DATASET
        --bigquery.mode BIGQUERY_MODE

optional arguments:
    -h, --help            
        show this help message and exit
    --output.notebook OUTPUT.NOTEBOOK
        Path to save executed notebook (Default: None). If not provided, no notebook is saved
    --postgresql.host POSTGRESQL_HOST
        POSTGRESQL host or IP address
    --postgresql.port POSTGRESQL_PORT
        POSTGRESQL port (Default: 3306)
    --postgresql.username POSTGRESQL_USERNAME
        POSTGRESQL username
    --postgresql.password POSTGRESQL_PASSWORD
        POSTGRESQL password
    --postgresql.database POSTGRESQL_DATABASE
        POSTGRESQL database name
    --postgresql.table.list POSTGRESQL_TABLE_LIST
        POSTGRESQL table list to migrate. Leave empty for migrating complete database else provide tables as "table1,table2"
    --postgresql.table.list POSTGRESQL_SCHEMA_LIST
        POSTGRESQL schema list to migrate. Only Migrate tables associated with the provided schema list
    --bigquery.dataset BIGQUERY_DATASET
        BIGQUERY dataset name
    --bigquery.dataset BIGQUERY_MODE
        BIGQUERY output write mode (Default: overwrite)

Required JAR files

This notebook requires the POSTGRESSQL connector jars. Installation information is present in the notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

postgresql2bq

postgresql2bq

README.md

Jupyter Notebook Solution for migrating PostgreSQL to BigQuery DWH using Dataproc Templates

Overview

Key Benefits

Requirements

Common Parameters

POSTGRESQL Parameters

BigQuery Parameters

Run programmatically with parameterize script

Required JAR files

Files

postgresql2bq

Directory actions

More options

Directory actions

More options

Latest commit

History

postgresql2bq

Folders and files

parent directory

README.md

Jupyter Notebook Solution for migrating PostgreSQL to BigQuery DWH using Dataproc Templates

Overview

Key Benefits

Requirements

Common Parameters

POSTGRESQL Parameters

BigQuery Parameters

Run programmatically with parameterize script

Required JAR files