Demo 1 - ETL and Visualizations with Open Data Hub

This demo consists of jupyter notebook, an elyra pipeline and a Superset Dashboard that provides an example of how to use the tools available within Open Data Hub on an Operate First cluster to perform ETL and create interactive dashboards and visualizations of our data.

Above is a flowchart which demonstrates the workflow followed for this demo. Before jumping into the workflow for recreating this Demo, let’s first take a look at some pre-requisites that are needed to create an initial setup.

Initial Setup

In order to access the environment for the development of the project, you will have to be added to the odh-env-users group here. This can be done by opening an issue on this page with the title "Add to odh-env-users".
Once added to the user’s list, you would be able to access JupyterHub, Kubeflow Pipelines, Trino, and the Superset Dashboard.
After logging into JupyterHub, select the AICoE OS-Climate Demo image to get started with the project.

Make sure you add the credentials file to the root folder of the project repository. For more details on how to set up your credentials, and retrieve your JWT token to access Trino, refer to the documentation given here.
To install the dependencies needed for running the notebooks, you can run a pipenv install at the root of the repository or use the Horus magic commands from within the notebook.

Data Collection and Processing

In the demo1-create-tables jupyter notebook, we start by collecting raw data and ingesting it into Trino.
In the demo1-join-tables jupyter notebook, we run a join against a different federated data table and ingest it into Trino.

ML Pipeline

To run the two notebooks in a sequential and automated fashion, we use Elyra notebook pipelines editor and kubeflow pipelines to ensure the workflow is automated.

You can access the saved Elyra Pipeline here.

Setup Kubeflow Pipelines

To run the pipeline, you will need the Elyra notebook pipeline editor. Make sure you are on the Elyra Notebook Image or the AICOE-OSC OS Climate Demo image on the OS-Climate JupyterHub
To get a Kubeflow pipeline running, you need to create a runtime image and a kubeflow pipelines runtime configuration.

Add runtime images

To create a runtime image using Elyra, follow the steps given here.

Fill all required fields to create a runtime image for the project repo:

Name: demo1-aicoe-osc
Image Name: quay.io/os-climate/aicoe-osc-demo:v.10

Add Kubeflow Pipeline runtime configuration

To create a kubeflow pipeline runtime configuration image using Elyra, follow the steps given here.

Insert all inputs for the Runtime"

Name: demo1-aicoe-osc
Kubeflow Pipeline API Endpoint: http://ml-pipeline-ui-kubeflow.apps.odh-cl1.apps.os-climate.org/pipeline
Kubeflow Pipeline Engine: Tekton
Cloud Object Storage Endpoint: S3_ENDPOINT
Cloud Object Storage Username: AWS_ACCESS_KEY_ID
Cloud Object Storage Password: AWS_SECRET_ACCESS_KEY
Cloud Object Storage Bucket Name: S3_BUCKET
Cloud Object Storage Credentials Secret: S3_SECRET

Set up Notebook Properties

There is a pre-existing Elyra Pipeline that has been setup for this demo called demo1.pipeline.

To trigger this pipeline, you need to make sure that the node properties for each notebook have been updated.

You need to fill in the cloud object storage bucket credentials like S3_ENDPOINT, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET, and the Trino database access credentials like TRINO_USER, TRINO_PASSWD, TRINO_HOST, TRINO_PORT. Please note, if you are using the Cloud Object Storage Credentials Secret field in the Kubeflow Pipelines Runtime configuration, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY can be omitted from the notebook properties as kubeflow is automatically able to read the encrypted credentials from the secret defined in OpenShift.

Run Pipeline

Once your pipeline is set up, you can run the pipeline by hitting the run button on the top left of the pipeline. You can give it a name and select the previously created Kubeflow Pipelines Runtime Configuration from the dropdown.

Visualization

After collecting the data and ingesting it into Trino, we would want to visualize various metrics that provide insights for the fetched data. We use Apache Superset Dashboards to visualize these metrics.

To learn how to use Trino and Superset in order to add more visualizations, you can follow the guide here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Demo 1 - ETL and Visualizations with Open Data Hub

Initial Setup

Data Collection and Processing

ML Pipeline

Setup Kubeflow Pipelines

Add runtime images

Add Kubeflow Pipeline runtime configuration

Set up Notebook Properties

Run Pipeline

Visualization

Files

README.md

Latest commit

History

README.md

File metadata and controls

Demo 1 - ETL and Visualizations with Open Data Hub

Initial Setup

Data Collection and Processing

ML Pipeline

Setup Kubeflow Pipelines

Add runtime images

Add Kubeflow Pipeline runtime configuration

Set up Notebook Properties

Run Pipeline

Visualization