Skip to content

Latest commit

 

History

History
96 lines (52 loc) · 6.4 KB

README.md

File metadata and controls

96 lines (52 loc) · 6.4 KB

Demo 1 - ETL and Visualizations with Open Data Hub

This demo consists of jupyter notebook, an elyra pipeline and a Superset Dashboard that provides an example of how to use the tools available within Open Data Hub on an Operate First cluster to perform ETL and create interactive dashboards and visualizations of our data.

Demo 1 Flowchart

Above is a flowchart which demonstrates the workflow followed for this demo. Before jumping into the workflow for recreating this Demo, let’s first take a look at some pre-requisites that are needed to create an initial setup.

Demo 1

Initial Setup

  • In order to access the environment for the development of the project, you will have to be added to the odh-env-users group here. This can be done by opening an issue on this page with the title "Add to odh-env-users".

  • Once added to the user’s list, you would be able to access JupyterHub, Kubeflow Pipelines, Trino, and the Superset Dashboard.

  • After logging into JupyterHub, select the AICoE OS-Climate Demo image to get started with the project.

Spawn JupyterHub

  • Make sure you add the credentials file to the root folder of the project repository. For more details on how to set up your credentials, and retrieve your JWT token to access Trino, refer to the documentation given here.

  • To install the dependencies needed for running the notebooks, you can run a pipenv install at the root of the repository or use the Horus magic commands from within the notebook.

Data Collection and Processing

  1. In the demo1-create-tables jupyter notebook, we start by collecting raw data and ingesting it into Trino.

  2. In the demo1-join-tables jupyter notebook, we run a join against a different federated data table and ingest it into Trino.

ML Pipeline

To run the two notebooks in a sequential and automated fashion, we use Elyra notebook pipelines editor and kubeflow pipelines to ensure the workflow is automated.

You can access the saved Elyra Pipeline here.

Setup Kubeflow Pipelines

  • To run the pipeline, you will need the Elyra notebook pipeline editor. Make sure you are on the Elyra Notebook Image or the AICOE-OSC OS Climate Demo image on the OS-Climate JupyterHub

  • To get a Kubeflow pipeline running, you need to create a runtime image and a kubeflow pipelines runtime configuration.

Add runtime images

To create a runtime image using Elyra, follow the steps given here.

Add runtime images

Fill all required fields to create a runtime image for the project repo:

  • Name: demo1-aicoe-osc
  • Image Name: quay.io/os-climate/aicoe-osc-demo:v.10

Add Kubeflow Pipeline runtime configuration

To create a kubeflow pipeline runtime configuration image using Elyra, follow the steps given here.

Add runtime configuration

Insert all inputs for the Runtime"

  • Name: demo1-aicoe-osc
  • Kubeflow Pipeline API Endpoint: http://ml-pipeline-ui-kubeflow.apps.odh-cl1.apps.os-climate.org/pipeline
  • Kubeflow Pipeline Engine: Tekton
  • Cloud Object Storage Endpoint: S3_ENDPOINT
  • Cloud Object Storage Username: AWS_ACCESS_KEY_ID
  • Cloud Object Storage Password: AWS_SECRET_ACCESS_KEY
  • Cloud Object Storage Bucket Name: S3_BUCKET
  • Cloud Object Storage Credentials Secret: S3_SECRET

Set up Notebook Properties

There is a pre-existing Elyra Pipeline that has been setup for this demo called demo1.pipeline.

Elyra Pipeline

To trigger this pipeline, you need to make sure that the node properties for each notebook have been updated.

Set up Notebook Properties

You need to fill in the cloud object storage bucket credentials like S3_ENDPOINT, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET, and the Trino database access credentials like TRINO_USER, TRINO_PASSWD, TRINO_HOST, TRINO_PORT. Please note, if you are using the Cloud Object Storage Credentials Secret field in the Kubeflow Pipelines Runtime configuration, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY can be omitted from the notebook properties as kubeflow is automatically able to read the encrypted credentials from the secret defined in OpenShift.

Run Pipeline

Once your pipeline is set up, you can run the pipeline by hitting the run button on the top left of the pipeline. You can give it a name and select the previously created Kubeflow Pipelines Runtime Configuration from the dropdown.

Run Pipeline

Visualization

After collecting the data and ingesting it into Trino, we would want to visualize various metrics that provide insights for the fetched data. We use Apache Superset Dashboards to visualize these metrics.

SuperSet Dashboard

To learn how to use Trino and Superset in order to add more visualizations, you can follow the guide here.