generated from aicoe-aiops/project-template
-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #72 from oindrillac/master
changes to project documentation for data sci website reorg
- Loading branch information
Showing
7 changed files
with
100 additions
and
87 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1,16 @@ | ||
# Mailing List Analysis Toolkit | ||
|
||
This repository contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases. | ||
This toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. We demonstrate this by performing text analysis on the Fedora mailing list. This list contains many discussions about the issues occurring with Fedora development on a monthly basis and suggestions for how to address the issues. | ||
|
||
# | ||
## Project Assumptions | ||
This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset. | ||
This project aims to help the Fedora community bring a more data driven approach to their planning process by performing text analysis and gathering insights into the trends in the email conversations. | ||
|
||
* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform. | ||
* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/). | ||
|
||
# | ||
## Project Overview | ||
The purpose of this project is to develop and maintain an open source NLP application that provides regular and up-to-date analytics for large open source development project mailing lists. | ||
|
||
### **Current Lists/ Datasets** | ||
* [Fedora Devel](https://lists.fedoraproject.org/archives/list/[email protected]/) | ||
* ? (please open an issue if you'd like another mailing list included) | ||
|
||
### **Application Overview** | ||
|
||
At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset. | ||
|
||
![](docs/assets/images/app-overview.png) | ||
|
||
|
||
### **User Interaction** | ||
|
||
The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves. | ||
|
||
Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date. | ||
|
||
* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here. | ||
|
||
![](docs/assets/images/fedora-dashboard.png) | ||
|
||
|
||
### **Notebook Architecture** | ||
|
||
Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements. | ||
|
||
**Current Notebooks**: | ||
|
||
* 01_collect_data | ||
|
||
* collect_data (Download new data from source and push to remote storage) | ||
* download_dataset (Download existing preprocessed data from remote storage) | ||
* gz_to_raw (Convert downloaded *.gz files to raw mbox format) | ||
* raw_to_meta (Process mbox files into monthly metadata *.csv and push to remote storage) | ||
* raw_to_text(Process mbox files into monthly email body *.csv and push to remote storage) | ||
* ? (please open an issue if you would like an additional data collection or pre processing step added) | ||
|
||
* 02_analyses | ||
|
||
* contributor_analysis (Quantify users monthly activity and push to remote storage) | ||
* keyword_analysis (Identify top Keywords for each month and push to remote storage) | ||
* ? (please open an issue if you would like an additional analysis added) | ||
|
||
**Adding Notebooks** | ||
Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases. | ||
|
||
One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository. For example, in order to add an additional analysis to the application one just needs to make submit a PR with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow. | ||
|
||
* **[Get Started](docs/getting-started.md)** | ||
|
||
### Automation and Workflow Configurations | ||
* **[How to Contribute](docs/contribute.md)** | ||
|
||
Please see the README at [manifests/README.md](manifests/README.md) for complete details on how to define the automation and application workflow via Argo. | ||
* **[Project Content](docs/content.md)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Content | ||
|
||
The project repository for mailing list analysis toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. | ||
|
||
Here's a [video](https://www.youtube.com/watch?v=arvpVoTXgZg) which goes over the project and demonstrates the automated dashboard. | ||
|
||
## Current Lists/ Datasets | ||
|
||
* [Fedora Devel](https://lists.fedoraproject.org/archives/list/[email protected]/) | ||
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you'd like another mailing list included) | ||
|
||
## Application Overview | ||
|
||
At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset. | ||
|
||
![](docs/assets/images/app-overview.png) | ||
|
||
Here is a [guide](../manifests/README.md) which outlines the steps needed to automate your Jupyter notebooks using Argo. By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD. | ||
|
||
## Notebooks | ||
|
||
Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements. | ||
|
||
|
||
* **01_collect_data** | ||
|
||
* [collect_data](../notebooks/01_collect_data/collect_data.ipynb) - Download new data from source and push to remote storage | ||
* [download_dataset](../notebooks/01_collect_data/download_datasets.ipynb) - Download existing preprocessed data from remote storage | ||
* [gz_to_raw](../notebooks/01_collect_data/gz_to_raw.ipynb) - Convert downloaded *.gz files to raw mbox format) | ||
* [raw_to_meta](../notebooks/01_collect_data/raw_to_meta.ipynb) - Process mbox files into monthly metadata *.csv and push to remote storage | ||
* [raw_to_text](../notebooks/01_collect_data/raw_to_text.ipynb) - Process mbox files into monthly email body *.csv and push to remote storage | ||
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional data collection or pre processing step added) | ||
|
||
* **02_analyses** | ||
|
||
* [contributor_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Quantify users monthly activity and push to remote storage) | ||
* [keyword_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Identify top Keywords for each month and push to remote storage) | ||
* [sentiment analysis](../notebooks/02_analyses/sentiment_analysis.ipynb) (Sentiment Analysis on body of emails) | ||
* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional analysis added) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Contribute to the Mailing List Analysis Toolkit | ||
|
||
To get started with familiarizing yourself with the Mailing List analysis project, check how to [Get Started](getting-started.md). | ||
|
||
## Add an Analysis | ||
|
||
One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository. | ||
|
||
For example, in order to add an additional analysis to the application one just needs to make [submit a PR](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/pulls) with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow. | ||
|
||
You can also reach out to [email protected] for any questions. | ||
|
||
## Automation and Workflow Configurations | ||
|
||
Please see the README at [manifests/README.md](../manifests/README.md) for complete details on how to define the automation and application workflow via Argo. By following the guide one can automate their application and Jupyter Notebooks using ArgoCD. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Get Started | ||
|
||
This project contains examples of how to perform end-to-end analysis on mailing lists. Check out the [project overview](../README.md) for an overview of this project. | ||
|
||
## Run the Notebooks | ||
|
||
There are interactive notebooks for this [project](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit) available for anyone to start using on the public [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login) instance on the [Massachusetts Open Cloud](https://massopen.cloud/) (MOC) right now! | ||
|
||
1. To get started, access [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/), select log in with `moc-sso` and sign in using your Google Account. | ||
2. After signing in, on the spawner page, please select the `Mailing list analysis toolkit` image in the JupyterHub Notebook Image section from the dropdown and select a `Medium` container size and hit `Start` to start your server. | ||
3. Once your server has spawned, you should see a directory titled `mailing-list-analysis-toolkit-<current-timestamp>`. | ||
4. Go to `notebooks/` and to look into the data collection and pre-processing steps explore the notebooks in the `01_collect_data` directory and to explore examples of analyses on mailing lists go through the notebooks in the `02_analyses` directory. | ||
5. Here's a video that that can help familiarize you with the project. | ||
|
||
`video: https://www.youtube.com/watch?v=arvpVoTXgZg` | ||
|
||
If you need more help navigating the Operate First environment, we have a few [short videos](https://www.youtube.com/playlist?list=PL8VBRDTElCWpneB4dBu4u1kHElZVWfAwW) to help you get started. | ||
|
||
|
||
## Project Assumptions | ||
|
||
This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset. | ||
|
||
* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform. | ||
* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/). | ||
|
||
### Dashboard | ||
|
||
The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves. | ||
|
||
Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date. | ||
|
||
* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here. This is currently accesible internally at Red Hat. However, there are also plans to make the analysis publicly accessible on Operate first (see [issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/67)) | ||
|
||
![](assets/images/fedora-dashboard.png) | ||
|
||
### Automated Argo workflows | ||
|
||
If you'd like to automate your Jupyter notebooks using Argo, please follow the steps outlined in this [guide](../manifests/README.md). By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD. |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters