This demo consists of jupyter notebook, an elyra pipeline and a Superset Dashboard that provides an example of how to use the tools available within Open Data Hub on an Operate First cluster to perform ETL and create interactive dashboards and visualizations of our data.
Above is a flowchart which demonstrates the workflow followed for this demo. Before jumping into the workflow for recreating this Demo, let’s first take a look at some pre-requisites that are needed to create an initial setup.
-
In order to access the environment for the development of the project, you will have to be added to the odh-env-users group here. This can be done by opening an issue on this page with the title "Add to odh-env-users".
-
Once added to the user’s list, you would be able to access JupyterHub, Kubeflow Pipelines, Trino, and the Superset Dashboard.
-
After logging into JupyterHub, select the
AICoE OS-Climate Demo
image to get started with the project.
-
Make sure you add the credentials file to the root folder of the project repository. For more details on how to set up your credentials, and retrieve your JWT token to access Trino, refer to the documentation given here.
-
To install the dependencies needed for running the notebooks, you can run a
pipenv install
at the root of the repository or use the Horus magic commands from within the notebook.
-
In the demo1-create-tables jupyter notebook, we start by collecting raw data and ingesting it into Trino.
-
In the demo1-join-tables jupyter notebook, we run a join against a different federated data table and ingest it into Trino.
To run the two notebooks in a sequential and automated fashion, we use Elyra notebook pipelines editor and kubeflow pipelines to ensure the workflow is automated.
You can access the saved Elyra Pipeline here.
-
To run the pipeline, you will need the Elyra notebook pipeline editor. Make sure you are on the
Elyra Notebook Image
or theAICOE-OSC OS Climate Demo
image on the OS-Climate JupyterHub -
To get a Kubeflow pipeline running, you need to create a runtime image and a kubeflow pipelines runtime configuration.
To create a runtime image using Elyra, follow the steps given here.
Fill all required fields to create a runtime image for the project repo:
- Name:
demo1-aicoe-osc
- Image Name:
quay.io/os-climate/aicoe-osc-demo:v.10
To create a kubeflow pipeline runtime configuration image using Elyra, follow the steps given here.
Insert all inputs for the Runtime"
- Name:
demo1-aicoe-osc
- Kubeflow Pipeline API Endpoint:
http://ml-pipeline-ui-kubeflow.apps.odh-cl1.apps.os-climate.org/pipeline
- Kubeflow Pipeline Engine:
Tekton
- Cloud Object Storage Endpoint:
S3_ENDPOINT
- Cloud Object Storage Username:
AWS_ACCESS_KEY_ID
- Cloud Object Storage Password:
AWS_SECRET_ACCESS_KEY
- Cloud Object Storage Bucket Name:
S3_BUCKET
- Cloud Object Storage Credentials Secret:
S3_SECRET
There is a pre-existing Elyra Pipeline that has been setup for this demo called demo1.pipeline
.
To trigger this pipeline, you need to make sure that the node properties for each notebook have been updated.
You need to fill in the cloud object storage bucket credentials like S3_ENDPOINT
, AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, S3_BUCKET
, and the Trino database access credentials like TRINO_USER
, TRINO_PASSWD
, TRINO_HOST
, TRINO_PORT
. Please note, if you are using the Cloud Object Storage Credentials Secret field in the Kubeflow Pipelines Runtime configuration, AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
can be omitted from the notebook properties as kubeflow is automatically able to read the encrypted credentials from the secret defined in OpenShift.
Once your pipeline is set up, you can run the pipeline by hitting the run button on the top left of the pipeline. You can give it a name and select the previously created Kubeflow Pipelines Runtime Configuration from the dropdown.
After collecting the data and ingesting it into Trino, we would want to visualize various metrics that provide insights for the fetched data. We use Apache Superset Dashboards to visualize these metrics.
To learn how to use Trino and Superset in order to add more visualizations, you can follow the guide here.