Skip to content

Commit

Permalink
Merge pull request #72 from oindrillac/master
Browse files Browse the repository at this point in the history
changes to project documentation for data sci website reorg
  • Loading branch information
sesheta committed Aug 4, 2021
2 parents 3adb4bc + 17656ab commit de9cd8d
Show file tree
Hide file tree
Showing 7 changed files with 100 additions and 87 deletions.
62 changes: 6 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,16 @@
# Mailing List Analysis Toolkit

This repository contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases.
This toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. We demonstrate this by performing text analysis on the Fedora mailing list. This list contains many discussions about the issues occurring with Fedora development on a monthly basis and suggestions for how to address the issues.

#
## Project Assumptions
This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset.
This project aims to help the Fedora community bring a more data driven approach to their planning process by performing text analysis and gathering insights into the trends in the email conversations.

* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform.
* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/).

#
## Project Overview
The purpose of this project is to develop and maintain an open source NLP application that provides regular and up-to-date analytics for large open source development project mailing lists.

### **Current Lists/ Datasets**
* [Fedora Devel](https://lists.fedoraproject.org/archives/list/[email protected]/)
* ? (please open an issue if you'd like another mailing list included)

### **Application Overview**

At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset.

![](docs/assets/images/app-overview.png)


### **User Interaction**

The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves.

Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date.

* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here.

![](docs/assets/images/fedora-dashboard.png)


### **Notebook Architecture**

Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements.

**Current Notebooks**:

* 01_collect_data

* collect_data (Download new data from source and push to remote storage)
* download_dataset (Download existing preprocessed data from remote storage)
* gz_to_raw (Convert downloaded *.gz files to raw mbox format)
* raw_to_meta (Process mbox files into monthly metadata *.csv and push to remote storage)
* raw_to_text(Process mbox files into monthly email body *.csv and push to remote storage)
* ? (please open an issue if you would like an additional data collection or pre processing step added)

* 02_analyses

* contributor_analysis (Quantify users monthly activity and push to remote storage)
* keyword_analysis (Identify top Keywords for each month and push to remote storage)
* ? (please open an issue if you would like an additional analysis added)

**Adding Notebooks**
Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases.

One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository. For example, in order to add an additional analysis to the application one just needs to make submit a PR with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow.

* **[Get Started](docs/getting-started.md)**

### Automation and Workflow Configurations
* **[How to Contribute](docs/contribute.md)**

Please see the README at [manifests/README.md](manifests/README.md) for complete details on how to define the automation and application workflow via Argo.
* **[Project Content](docs/content.md)**
39 changes: 39 additions & 0 deletions docs/content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Content

The project repository for mailing list analysis toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift.

Here's a [video](https://www.youtube.com/watch?v=arvpVoTXgZg) which goes over the project and demonstrates the automated dashboard.

## Current Lists/ Datasets

* [Fedora Devel](https://lists.fedoraproject.org/archives/list/[email protected]/)
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you'd like another mailing list included)

## Application Overview

At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset.

![](docs/assets/images/app-overview.png)

Here is a [guide](../manifests/README.md) which outlines the steps needed to automate your Jupyter notebooks using Argo. By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD.

## Notebooks

Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements.


* **01_collect_data**

* [collect_data](../notebooks/01_collect_data/collect_data.ipynb) - Download new data from source and push to remote storage
* [download_dataset](../notebooks/01_collect_data/download_datasets.ipynb) - Download existing preprocessed data from remote storage
* [gz_to_raw](../notebooks/01_collect_data/gz_to_raw.ipynb) - Convert downloaded *.gz files to raw mbox format)
* [raw_to_meta](../notebooks/01_collect_data/raw_to_meta.ipynb) - Process mbox files into monthly metadata *.csv and push to remote storage
* [raw_to_text](../notebooks/01_collect_data/raw_to_text.ipynb) - Process mbox files into monthly email body *.csv and push to remote storage
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional data collection or pre processing step added)

* **02_analyses**

* [contributor_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Quantify users monthly activity and push to remote storage)
* [keyword_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Identify top Keywords for each month and push to remote storage)
* [sentiment analysis](../notebooks/02_analyses/sentiment_analysis.ipynb) (Sentiment Analysis on body of emails)
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional analysis added)
15 changes: 15 additions & 0 deletions docs/contribute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Contribute to the Mailing List Analysis Toolkit

To get started with familiarizing yourself with the Mailing List analysis project, check how to [Get Started](getting-started.md).

## Add an Analysis

One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository.

For example, in order to add an additional analysis to the application one just needs to make [submit a PR](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/pulls) with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow.

You can also reach out to [email protected] for any questions.

## Automation and Workflow Configurations

Please see the README at [manifests/README.md](../manifests/README.md) for complete details on how to define the automation and application workflow via Argo. By following the guide one can automate their application and Jupyter Notebooks using ArgoCD.
39 changes: 39 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Get Started

This project contains examples of how to perform end-to-end analysis on mailing lists. Check out the [project overview](../README.md) for an overview of this project.

## Run the Notebooks

There are interactive notebooks for this [project](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit) available for anyone to start using on the public [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login) instance on the [Massachusetts Open Cloud](https://massopen.cloud/) (MOC) right now!

1. To get started, access [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/), select log in with `moc-sso` and sign in using your Google Account.
2. After signing in, on the spawner page, please select the `Mailing list analysis toolkit` image in the JupyterHub Notebook Image section from the dropdown and select a `Medium` container size and hit `Start` to start your server.
3. Once your server has spawned, you should see a directory titled `mailing-list-analysis-toolkit-<current-timestamp>`.
4. Go to `notebooks/` and to look into the data collection and pre-processing steps explore the notebooks in the `01_collect_data` directory and to explore examples of analyses on mailing lists go through the notebooks in the `02_analyses` directory.
5. Here's a video that that can help familiarize you with the project.

`video: https://www.youtube.com/watch?v=arvpVoTXgZg`

If you need more help navigating the Operate First environment, we have a few [short videos](https://www.youtube.com/playlist?list=PL8VBRDTElCWpneB4dBu4u1kHElZVWfAwW) to help you get started.


## Project Assumptions

This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset.

* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform.
* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/).

### Dashboard

The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves.

Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date.

* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here. This is currently accesible internally at Red Hat. However, there are also plans to make the analysis publicly accessible on Operate first (see [issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/67))

![](assets/images/fedora-dashboard.png)

### Automated Argo workflows

If you'd like to automate your Jupyter notebooks using Argo, please follow the steps outlined in this [guide](../manifests/README.md). By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD.
6 changes: 0 additions & 6 deletions docs/getting-started.rst

This file was deleted.

24 changes: 0 additions & 24 deletions docs/index.rst

This file was deleted.

2 changes: 1 addition & 1 deletion manifests/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Automated Argo workflows

If you'd like to automate your Jupyter notebooks using Argo, please use these kustomize manifests. If you follow the steps bellow, your application is fully set and ready to be deployed via Argo CD.
If you'd like to automate your Jupyter notebooks using Argo, please use these kustomize manifests. If you follow the steps below, your application is fully set and ready to be deployed via Argo CD.

For a detailed guide on how to adjust your notebooks etc, please consult [documentation](https://github.com/aicoe-aiops/data-science-workflows/blob/master/Automating%20via%20Argo.md)

Expand Down

0 comments on commit de9cd8d

Please sign in to comment.