From 17656ab5730ff500e9bb30cb82c22b2b23dc5303 Mon Sep 17 00:00:00 2001 From: Oindrilla Chatterjee Date: Wed, 4 Aug 2021 15:09:23 -0400 Subject: [PATCH] changes to project document for website reorg --- README.md | 62 ++++------------------------------------ docs/content.md | 39 +++++++++++++++++++++++++ docs/contribute.md | 15 ++++++++++ docs/getting-started.md | 39 +++++++++++++++++++++++++ docs/getting-started.rst | 6 ---- docs/index.rst | 24 ---------------- manifests/README.md | 2 +- 7 files changed, 100 insertions(+), 87 deletions(-) create mode 100644 docs/content.md create mode 100644 docs/contribute.md create mode 100644 docs/getting-started.md delete mode 100644 docs/getting-started.rst delete mode 100644 docs/index.rst diff --git a/README.md b/README.md index a29d243..7407762 100644 --- a/README.md +++ b/README.md @@ -1,66 +1,16 @@ # Mailing List Analysis Toolkit -This repository contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases. +This toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. We demonstrate this by performing text analysis on the Fedora mailing list. This list contains many discussions about the issues occurring with Fedora development on a monthly basis and suggestions for how to address the issues. -# -## Project Assumptions -This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset. +This project aims to help the Fedora community bring a more data driven approach to their planning process by performing text analysis and gathering insights into the trends in the email conversations. -* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform. -* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/). - -# -## Project Overview The purpose of this project is to develop and maintain an open source NLP application that provides regular and up-to-date analytics for large open source development project mailing lists. -### **Current Lists/ Datasets** -* [Fedora Devel](https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/) -* ? (please open an issue if you'd like another mailing list included) - -### **Application Overview** - -At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset. - -![](docs/assets/images/app-overview.png) - - -### **User Interaction** - -The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves. - -Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date. - -* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here. - -![](docs/assets/images/fedora-dashboard.png) - - -### **Notebook Architecture** - -Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements. - -**Current Notebooks**: - -* 01_collect_data - - * collect_data (Download new data from source and push to remote storage) - * download_dataset (Download existing preprocessed data from remote storage) - * gz_to_raw (Convert downloaded *.gz files to raw mbox format) - * raw_to_meta (Process mbox files into monthly metadata *.csv and push to remote storage) - * raw_to_text(Process mbox files into monthly email body *.csv and push to remote storage) - * ? (please open an issue if you would like an additional data collection or pre processing step added) - - * 02_analyses - - * contributor_analysis (Quantify users monthly activity and push to remote storage) - * keyword_analysis (Identify top Keywords for each month and push to remote storage) - * ? (please open an issue if you would like an additional analysis added) - -**Adding Notebooks** +Although the specific example here is a mailing list analysis tool, our hope is to show that this approach could be easily modified and adapted to many intelligent application use cases. -One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository. For example, in order to add an additional analysis to the application one just needs to make submit a PR with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow. +* **[Get Started](docs/getting-started.md)** -### Automation and Workflow Configurations +* **[How to Contribute](docs/contribute.md)** -Please see the README at [manifests/README.md](manifests/README.md) for complete details on how to define the automation and application workflow via Argo. +* **[Project Content](docs/content.md)** diff --git a/docs/content.md b/docs/content.md new file mode 100644 index 0000000..7e955e2 --- /dev/null +++ b/docs/content.md @@ -0,0 +1,39 @@ +# Content + +The project repository for mailing list analysis toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift. + +Here's a [video](https://www.youtube.com/watch?v=arvpVoTXgZg) which goes over the project and demonstrates the automated dashboard. + +## Current Lists/ Datasets + +* [Fedora Devel](https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/) +* ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you'd like another mailing list included) + +## Application Overview + +At a high level, this application can be seen as an [Argo Workflow](https://argoproj.github.io/argo/) which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the [Fedora HyperKitty mailing list archive](https://lists.fedoraproject.org/archives/) (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset. + +![](docs/assets/images/app-overview.png) + +Here is a [guide](../manifests/README.md) which outlines the steps needed to automate your Jupyter notebooks using Argo. By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD. + +## Notebooks + +Currently notebooks are divided into two sub-directories `notebooks/01_collect_data` and `notebooks/02_analyses` depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the `wftmpl.yaml` manifest file. Ideally, the notebooks in `notebooks/01_collect_data` should not be dependent on each other (they could be run in parallel) and notebooks in `notebooks/02_analyses` should be independent of each other and only depend on the output of notebooks in `notebooks/01_collect_date`. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements. + + +* **01_collect_data** + + * [collect_data](../notebooks/01_collect_data/collect_data.ipynb) - Download new data from source and push to remote storage + * [download_dataset](../notebooks/01_collect_data/download_datasets.ipynb) - Download existing preprocessed data from remote storage + * [gz_to_raw](../notebooks/01_collect_data/gz_to_raw.ipynb) - Convert downloaded *.gz files to raw mbox format) + * [raw_to_meta](../notebooks/01_collect_data/raw_to_meta.ipynb) - Process mbox files into monthly metadata *.csv and push to remote storage + * [raw_to_text](../notebooks/01_collect_data/raw_to_text.ipynb) - Process mbox files into monthly email body *.csv and push to remote storage + * ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional data collection or pre processing step added) + + * **02_analyses** + + * [contributor_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Quantify users monthly activity and push to remote storage) + * [keyword_analysis](../notebooks/02_analyses/contributor_analysis.ipynb) (Identify top Keywords for each month and push to remote storage) + * [sentiment analysis](../notebooks/02_analyses/sentiment_analysis.ipynb) (Sentiment Analysis on body of emails) + * ? (please [open an issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/new?assignees=&labels=enhancement&template=feature_request.md) if you would like an additional analysis added) diff --git a/docs/contribute.md b/docs/contribute.md new file mode 100644 index 0000000..922e1ef --- /dev/null +++ b/docs/contribute.md @@ -0,0 +1,15 @@ +# Contribute to the Mailing List Analysis Toolkit + +To get started with familiarizing yourself with the Mailing List analysis project, check how to [Get Started](getting-started.md). + +## Add an Analysis + +One of the key benefits of this approach is that it allows for the bulk of the development to be done directly in a jupyter notebook as well as making adding new analyses or preprocessing steps as simple adding a new notebook to the repository. + +For example, in order to add an additional analysis to the application one just needs to make [submit a PR](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/pulls) with a new notebook in `notebooks/02_analyses` and a small update to `manifest/wftmpl.yaml` to include the new notebook into the workflow. + +You can also reach out to aicoe-aiops@redhat.com for any questions. + +## Automation and Workflow Configurations + +Please see the README at [manifests/README.md](../manifests/README.md) for complete details on how to define the automation and application workflow via Argo. By following the guide one can automate their application and Jupyter Notebooks using ArgoCD. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..5cfe4b0 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,39 @@ +# Get Started + +This project contains examples of how to perform end-to-end analysis on mailing lists. Check out the [project overview](../README.md) for an overview of this project. + +## Run the Notebooks + +There are interactive notebooks for this [project](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit) available for anyone to start using on the public [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login) instance on the [Massachusetts Open Cloud](https://massopen.cloud/) (MOC) right now! + +1. To get started, access [JupyterHub](https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/), select log in with `moc-sso` and sign in using your Google Account. +2. After signing in, on the spawner page, please select the `Mailing list analysis toolkit` image in the JupyterHub Notebook Image section from the dropdown and select a `Medium` container size and hit `Start` to start your server. +3. Once your server has spawned, you should see a directory titled `mailing-list-analysis-toolkit-`. +4. Go to `notebooks/` and to look into the data collection and pre-processing steps explore the notebooks in the `01_collect_data` directory and to explore examples of analyses on mailing lists go through the notebooks in the `02_analyses` directory. +5. Here's a video that that can help familiarize you with the project. + +`video: https://www.youtube.com/watch?v=arvpVoTXgZg` + +If you need more help navigating the Operate First environment, we have a few [short videos](https://www.youtube.com/playlist?list=PL8VBRDTElCWpneB4dBu4u1kHElZVWfAwW) to help you get started. + + +## Project Assumptions + +This repository assumes that you have an existing Open Data Hub deployed on OpenShift that includes, JupyterHub, Argo, Ceph, Hive, Cloudera Hue and Apache Superset. + +* Take a look at [opendatahub.io](https://www.opendatahub.io) for details on the open data hub platform. +* Details of our existing public deployment can be found at [operate-first.cloud](https://www.operate-first.cloud/). + +### Dashboard + +The primary output and user interface for this application is a [Superset](https://superset.apache.org/) dashboard. This tool allows us to define certain data visualization elements from our analysis that we would like to publish and share with others, while also including enough flexibility and interactivity to allow users to explore the data themselves. + +Our application is designed to automatically re-run the analyses on regular basis and ensure that the dashboard and all its plots are current and up to date. + +* Current [Superset Dashboard](https://superset.datahub.redhat.com/superset/dashboard/fedora_mail/) can be found here. This is currently accesible internally at Red Hat. However, there are also plans to make the analysis publicly accessible on Operate first (see [issue](https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/67)) + +![](assets/images/fedora-dashboard.png) + +### Automated Argo workflows + +If you'd like to automate your Jupyter notebooks using Argo, please follow the steps outlined in this [guide](../manifests/README.md). By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD. diff --git a/docs/getting-started.rst b/docs/getting-started.rst deleted file mode 100644 index b4f71c3..0000000 --- a/docs/getting-started.rst +++ /dev/null @@ -1,6 +0,0 @@ -Getting started -=============== - -This is where you describe how to get set up on a clean install, including the -commands necessary to get the raw data (using the `sync_data_from_s3` command, -for example), and then how to make the cleaned, final data sets. diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index ee1dba1..0000000 --- a/docs/index.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. project-template documentation master file, created by - sphinx-quickstart. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -project-template documentation! -============================================== - -Contents: - -.. toctree:: - :maxdepth: 2 - - getting-started - commands - - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/manifests/README.md b/manifests/README.md index c072035..8588332 100644 --- a/manifests/README.md +++ b/manifests/README.md @@ -1,6 +1,6 @@ # Automated Argo workflows -If you'd like to automate your Jupyter notebooks using Argo, please use these kustomize manifests. If you follow the steps bellow, your application is fully set and ready to be deployed via Argo CD. +If you'd like to automate your Jupyter notebooks using Argo, please use these kustomize manifests. If you follow the steps below, your application is fully set and ready to be deployed via Argo CD. For a detailed guide on how to adjust your notebooks etc, please consult [documentation](https://github.com/aicoe-aiops/data-science-workflows/blob/master/Automating%20via%20Argo.md)