Skip to content

Commit

Permalink
updating project contribution docs (#262)
Browse files Browse the repository at this point in the history
  • Loading branch information
oindrillac committed May 12, 2021
1 parent 03eafff commit 12b1b04
Show file tree
Hide file tree
Showing 5 changed files with 130 additions and 105 deletions.
5 changes: 5 additions & 0 deletions .env-example
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
S3_ENDPOINT=<insert-endpoint>
S3_ACCESS_KEY=<insert-access-key>
S3_SECRET_KEY=<insert-secret-key>
S3_BUCKET=<insert-bucket-name>
IN_AUTOMATION=True
111 changes: 6 additions & 105 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,118 +1,19 @@

# AI Supported Continuous Integration

_Developing AI tools for developers by leveraging the data made openly available by OpenShift and Kubernetes CI platforms._

AIOps is a critical component of supporting any Open Hybrid Cloud infrastructure. As the systems we operate become larger and more complex, intelligent monitoring tools and response agents will become a necessity. In an effort to accelerate the development, access and reliability of these intelligent operations, we aim to provide access to an open community with open operations data and an open infrastructure for data scientists and DevOps
engineers to collaberate.
AIOps is a critical component of supporting any Open Hybrid Cloud infrastructure. As the systems we operate become larger and more complex, intelligent monitoring tools and response agents will become a necessity. In an effort to accelerate the development, access and reliability of these intelligent operations, we aim to provide access to an open community with open operations data and an open infrastructure for data scientists and DevOps engineers to collaborate.

One major component of the software development and operations workflow is Continuous Integration (CI), which involves running automated builds and tests of software before it is merged into a production code base. For example, if you are developing a container orchestration platform like Kubernetes or OpenShift, these are huge code bases with large builds and many tests that will produce a lot of data that can be difficult to parse if you are trying to figure out why a build is failing or why a certain set of tests aren’t passing.

OpenShift, Kubernetes and a few other platforms have made their CI data public. This is real world multimodal production operations data, a rarity for public data sets today. This presents a great starting point and a first initial area of investigation for the AIOps community to tackle. Our aim is to cultivate open source AIOps projects by developing, integrating and operating AI tools for CI by leveraging the open data that has been made available by OpenShift, Kubernetes and others.

## Try it out!

There are interactive and reproducible notebooks for this entire [project](https://github.com/aicoe-aiops/ocp-ci-analysis) available for anyone to start using on a public [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login) instance on the [Massachusetts Open Cloud](https://massopen.cloud/) (MOC) right now! After signing in, please select the `ocp-ci-analysis:latest` image from the dropdown on the spawner page to explore this project.

<br>

## Current Work:

### Data Engineering: Metrics and KPIs for CI

Before we attempt to apply any AI or machine learning techniques to improve the CI workflow, it is important that we know how to both quantify and evaluate the current state of the CI workflow. In order to do this we must establish and collect the relevant metrics and key performance indicators (KPIs) needed to measure it. This is a critical first step as it allows us to quantify the state of CI operations, as well as apply the KPIs we will need to evaluate the impact of our AI tools in the future.

There are currently five open datasets that can be used to help us fully describe the CI process: Testgrid, Prow, Github, Telemetry and Bugzilla. This data is currently stored in disperate locations and does not exist in a data science friendly format ready for analysis. Below are our current efforts to collect and prepare each of these datasets for further analysis.

<br>

#### TestGrid:

According to the project's [readme](https://github.com/GoogleCloudPlatform/testgrid), TestGrid is a, "highly configurable, interactive dashboard for viewing your test results in a grid!" In other words, it’s an aggregation and visualization platform for CI data. Testgrid primarily reports categorical metrics about which tests passed or failed during specific builds over a configurable time window.

* [List of metrics and KPI's from TestGrid](notebooks/data-sources/TestGrid/metrics/README.md)
* [TestGrid data access and pre-processing](notebooks/data-sources/TestGrid/testgrid_EDA.ipynb)
* [Collect raw data notebook](notebooks/data-sources/TestGrid/metrics/get_raw_data.ipynb)
* Visualization notebook
* [Automated metric pipeline](http://istio-ingressgateway-istio-system.apps.zero.massopen.cloud/pipeline/)
* [Explainer video](https://www.youtube.com/watch?v=lY75bDv6kd4)

<br>

#### Prow/GCS Artifacts:

TestGrid provides the results of tests, but if we want to triage an issue and see the actual logs generated during the building and testing process, we will need to access the logs generated by [prow](https://prow.ci.openshift.org/) and stored in google cloud storage. This dataset contains all of the log data generated for each build and each job as directories and text files in remote storage.

* Notebook (forthcoming)

<br>

#### Github:

The builds and tests run by the CI process are required because of changes that are happening in the applications code base. The goal of CI is to automatically identify if any of these code changes will cause problems for the deployed application. Therefore, we also include information such as metadata and diff’s about the PR’s associated with the builds run by Prow. This dataset contains a mix of numerical, categorical and textual data types.

* Notebook (forthcoming)

<br>

#### Bugzilla:

[Bugzilla](https://bugzilla.redhat.com/) is Red Hat’s bug-tracking system and is used to submit and review defects that have been found in Red Hat distributions. In addition to TestGrid, analyzing bugs related to OpenShift CI can help us get into automated root cause analysis. This is primarily a dataset of human written text.

* Notebook (forthcoming)

<br>

#### Telemetry:

The builds and tests we are interested in analyzing run in the cloud and produce metrics about the resources they are using and report any alerts they experience while running. These metrics are stored in Thanos for users to query. Here we have primarily numerical time series data that describes the underlying state of the cluster running these builds and tests.

* [Telemerty EDA notebook](notebooks/data-sources/Telemetry/telemetry_EDA.ipynb)

<br>

### How to Contribute to the data engineering:

We encourage contributions to this work of developing additional KPI's by opening an [issue](https://github.com/aicoe-aiops/ocp-ci-analysis/issues) with the prefix `KPI Request:` to request a KPI you would like to have included or by writing a new notebook to fulfill one of the existing open `KPI Request:` issues. We have written a [KPI template notebook](notebooks/data-sources/TestGrid/metrics/metric_template.ipynb) to make contributing new metrics as simple and as uniform as possible.

</br>

## Machine Learning and Analytics Projects

With the data sources made easily accessible and with the necessary metrics and KPIs available to quantify and evaluate the CI workflow we can start to apply some AI and machine learning techniques to help improve the CI workflow. There are many ways in which this could be done given the multimodal, multi-source nature of our data. Instead of defining a single specific problem to solve, our current aim is to use this repository as a hub for multiple machine learning and analytics projects centered around this data for AIOps problems focused on improving CI workflows. Below is a list of the current ML and analytics projects.

</br>

### TestGrid Failure Type Classification

Currently, human subject matter experts are able to identify different types of failures by looking at the testgrids. This is, however, a manual process. This project aims to automate the manual identification process for individual Testgrids. This can be thought of as a classification problem aimed at classifying errors on the testgrids as either flakey tests, infra flakes, install flakes or new test failures.

* [Detailed project description](notebooks/failure-type-classification/README.md)
* [Failure Type Classification Notebook](notebooks/failure-type-classification/stage/failure_type_classifier.ipynb)

</br>

### Prow Log Templating For Downstream ML Tasks

Logs represent a rich source of information for automated triaging and root cause analysis. Unfortunately, logs are very noisy data types, i.e, two logs that are of the same type but from two different sources may be different enough at a character level that traditional comparison methods are insufficient to capture this similarity. To overcome this issue, we will use the Prow logs made available to us by this project to identify useful methods for learning log templates that denoise log data and help improve performance on downstream ML tasks.


* Notebook (forthcoming)

</br>

### More Projects Coming Soon…

* [List of potential ML projects](https://github.com/aicoe-aiops/ocp-ci-analysis/issues?q=is%3Aissue+is%3Aopen+%22ML+Request%22+)

</br>

### How to contribute to Machine Learning
* **[Get Started](docs/get-started.md)**

We encourage you to contribute to this work developing additional machine learning analyses by opening an [issue](https://github.com/aicoe-aiops/ocp-ci-analysis/issues) with the prefix `ML Request:` to request a machine learning application you would like to have included or by writing a new notebook to fulfill one of the existing open `ML Request:` issues.
* **[How to Contribute](docs/how-to-contribute.md)**

</br>
* **[Project Content](docs/content.md)**

### @Contact
## Contact

This project is maintained as part of the [Operate First](https://www.operate-first.cloud/) and AIOps teams in Red Hat’s AI CoE as part of the Office of the CTO. More information can be found at https://www.operate-first.cloud/
This project is maintained as part of the [Operate First](https://www.operate-first.cloud/) and AIOps teams in Red Hat’s AI CoE as part of the Office of the CTO. More information can be found at https://www.operate-first.cloud/.
83 changes: 83 additions & 0 deletions docs/content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Table of Contents
- [Research on current industry offerings](#research-on-current-industry-offerings)
- [Data Engineering: Metrics and KPIs for CI](#data-engineering--metrics-and-kpis-for-ci)
* [TestGrid](#testgrid-)
* [Prow/GCS Artifacts](#prow-gcs-artifacts-)
* [Github](#github-)
* [Bugzilla](#bugzilla-)
* [Telemetry](#telemetry-)
- [Machine Learning and Analytics Projects](#machine-learning-and-analytics-projects)
* [TestGrid Failure Type Classification](#testgrid-failure-type-classification)
* [Prow Log Templating For Downstream ML Tasks](#prow-log-templating-for-downstream-ml-tasks)
* [More Projects Coming Soon…](#more-projects-coming-soon-)
- [Automate Notebook Pipelines using Elyra and Kubeflow](#automate-notebook-pipelines-using-elyra-and-kubeflow)

# Research on current industry offerings

Find a curated list of companies involved in AI/ML for CI/CD [here](docs/aiml-cicd-market-research.md).

# Data Engineering: Metrics and KPIs for CI

Before we attempt to apply any AI or machine learning techniques to improve the CI workflow, it is important that we know how to both quantify and evaluate the current state of the CI workflow. In order to do this we must establish and collect the relevant metrics and key performance indicators (KPIs) needed to measure it. This is a critical first step as it allows us to quantify the state of CI operations, as well as apply the KPIs we will need to evaluate the impact of our AI tools in the future.

There are currently five open datasets that can be used to help us fully describe the CI process: Testgrid, Prow, Github, Telemetry and Bugzilla. This data is currently stored in disparate locations and does not exist in a data science friendly format ready for analysis. Below are our current efforts to collect and prepare each of these datasets for further analysis.

## TestGrid:

According to the project's [readme](https://github.com/GoogleCloudPlatform/testgrid), TestGrid is a, "highly configurable, interactive dashboard for viewing your test results in a grid!" In other words, it’s an aggregation and visualization platform for CI data. Testgrid primarily reports categorical metrics about which tests passed or failed during specific builds over a configurable time window.

* [List of metrics and KPI's from TestGrid](notebooks/data-sources/TestGrid/metrics/README.md)
* [TestGrid data access and pre-processing](notebooks/data-sources/TestGrid/testgrid_EDA.ipynb)
* [Collect raw data notebook](notebooks/data-sources/TestGrid/metrics/get_raw_data.ipynb)
* Visualization notebook
* [Automated metric pipeline](http://istio-ingressgateway-istio-system.apps.zero.massopen.cloud/pipeline/)
* [Explainer video](https://www.youtube.com/watch?v=lY75bDv6kd4)

## Prow/GCS Artifacts:

TestGrid provides the results of tests, but if we want to triage an issue and see the actual logs generated during the building and testing process, we will need to access the logs generated by [prow](https://prow.ci.openshift.org/) and stored in google cloud storage. This dataset contains all of the log data generated for each build and each job as directories and text files in remote storage.

* [Prow Archive Discovery](../notebooks/data-sources/gcsweb-ci/prow_archive_discovery.ipynb)

## Github:

The builds and tests run by the CI process are required because of changes that are happening in the applications code base. The goal of CI is to automatically identify if any of these code changes will cause problems for the deployed application. Therefore, we also include information such as metadata and diff’s about the PR’s associated with the builds run by Prow. This dataset contains a mix of numerical, categorical and textual data types.

* Notebook (forthcoming)

## Bugzilla:

[Bugzilla](https://bugzilla.redhat.com/) is Red Hat’s bug-tracking system and is used to submit and review defects that have been found in Red Hat distributions. In addition to TestGrid, analyzing bugs related to OpenShift CI can help us get into automated root cause analysis. This is primarily a dataset of human written text.

* [Bugzilla EDA notebook](notebooks/data-sources/Bugzilla/bugzilla_EDA.ipynb)

## Telemetry:

The builds and tests we are interested in analyzing run in the cloud and produce metrics about the resources they are using and report any alerts they experience while running. These metrics are stored in Thanos for users to query. Here we have primarily numerical time series data that describes the underlying state of the cluster running these builds and tests.

* [Telemerty EDA notebook](notebooks/data-sources/Telemetry/telemetry_EDA.ipynb)

# Machine Learning and Analytics Projects

With the data sources made easily accessible and with the necessary metrics and KPIs available to quantify and evaluate the CI workflow we can start to apply some AI and machine learning techniques to help improve the CI workflow. There are many ways in which this could be done given the multimodal, multi-source nature of our data. Instead of defining a single specific problem to solve, our current aim is to use this repository as a hub for multiple machine learning and analytics projects centered around this data for AIOps problems focused on improving CI workflows. Below is a list of the current ML and analytics projects.

## TestGrid Failure Type Classification

Currently, human subject matter experts are able to identify different types of failures by looking at the testgrids. This is, however, a manual process. This project aims to automate the manual identification process for individual Testgrids. This can be thought of as a classification problem aimed at classifying errors on the testgrids as either flakey tests, infra flakes, install flakes or new test failures.

* [Detailed project description](notebooks/failure-type-classification/README.md)
* [Failure Type Classification Notebook](notebooks/failure-type-classification/stage/failure_type_classifier.ipynb)

## Prow Log Templating For Downstream ML Tasks

Logs represent a rich source of information for automated triaging and root cause analysis. Unfortunately, logs are very noisy data types, i.e, two logs that are of the same type but from two different sources may be different enough at a character level that traditional comparison methods are insufficient to capture this similarity. To overcome this issue, we will use the Prow logs made available to us by this project to identify useful methods for learning log templates that denoise log data and help improve performance on downstream ML tasks.

* Notebook (forthcoming)

## More Projects Coming Soon…

* [List of potential ML projects](https://github.com/aicoe-aiops/ocp-ci-analysis/issues?q=is%3Aissue+is%3Aopen+%22ML+Request%22+).

# Automate Notebook Pipelines using Elyra and Kubeflow

In order to automate the sequential running of the various notebooks in the project responsible for data collection, metric calculation, ML analysis, we are using Kubeflow Pipelines. For more information on using Elyra and Kubeflow pipelines to automate the notebook workflows follow the [guide](automating-using-elyra.md).
16 changes: 16 additions & 0 deletions docs/get-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Get Started

The aim of the AI for continuous integration project is to build an open AIOps community involved in developing, integrating and operating AI tools for CI by leveraging the open data that has been made available by OpenShift, Kubernetes and others. Check out the [project overview](../README.md) for a detailed overview of this project.

## Try it out yourself

There are interactive and reproducible notebooks for this entire [project](https://github.com/aicoe-aiops/ocp-ci-analysis) available for anyone to start using on the public [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login) instance on the [Massachusetts Open Cloud](https://massopen.cloud/) (MOC) right now!

1. To get started, access [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/), select log in with `moc-sso` and sign in using your Google Account.
2. After signing in, on the spawner page, please select the `ocp-ci-analysis:latest` image in the JupyterHub Notebook Image section from the dropdown and select a `Medium` container size and hit `Start` to start your server.
3. Once your server has spawned, you should see a directory titled `ocp-ci-analysis-<current-timestamp>`. Browse through, run the various notebooks and start exploring this project.
4. To interact with the S3 bucket and access the stored datasets, make sure you have a `.env` file at the root of your repo. Check [.env-example](../.env-example) for an example `.env` file and open an [issue](https://github.com/aicoe-aiops/ocp-ci-analysis/issues) for access credentials.

You can find more information on the various notebooks and their purpose [here](content.md).

If you need more help navigating the Operate First environment, we have a few [short videos](https://www.youtube.com/playlist?list=PL8VBRDTElCWpneB4dBu4u1kHElZVWfAwW) to help you get started.
Loading

0 comments on commit 12b1b04

Please sign in to comment.