Skip to content

Latest commit

 

History

History
39 lines (24 loc) · 4.05 KB

content.md

File metadata and controls

39 lines (24 loc) · 4.05 KB

Content

The project repository for mailing list analysis toolkit contains example code for how to develop a custom end-to-end email analytics service using the Open Data Hub on OpenShift.

Here's a video which goes over the project and demonstrates the automated dashboard.

Current Lists/ Datasets

Application Overview

At a high level, this application can be seen as an Argo Workflow which orchestrates a set of Jupyter notebooks to push transformed data to Ceph. Each notebook is responsible for a single task and is used either for collecting raw data from the Fedora HyperKitty mailing list archive (our live data set), preprocessing that data, or performing some specific analytics task. In almost all cases, the notebooks both push their outputs to Ceph remote storage (for use in a future run) as well as maintain a local copy within a shared volume among the application's pods for use by other notebook processes. Finally we use external tables in Apache Hive, with Hue, to connect the Ceph data to a SQL database for interactive visualization with Superset.

Here is a guide which outlines the steps needed to automate your Jupyter notebooks using Argo. By following the steps in the document, your application can be fully set and ready to be deployed via Argo CD.

Notebooks

Currently notebooks are divided into two sub-directories notebooks/01_collect_data and notebooks/02_analyses depending on what stage in the argo workflow they belong to. This should make it explicit where notebooks go in the argo workflow dependency tree when defining it in the wftmpl.yaml manifest file. Ideally, the notebooks in notebooks/01_collect_data should not be dependent on each other (they could be run in parallel) and notebooks in notebooks/02_analyses should be independent of each other and only depend on the output of notebooks in notebooks/01_collect_date. That way we keep the workflow and dependency structure clear during development and we believe this architecture can be easily modified to accommodate more complex dependency requirements.

  • 01_collect_data

    • collect_data - Download new data from source and push to remote storage
    • download_dataset - Download existing preprocessed data from remote storage
    • gz_to_raw - Convert downloaded *.gz files to raw mbox format)
    • raw_to_meta - Process mbox files into monthly metadata *.csv and push to remote storage
    • raw_to_text - Process mbox files into monthly email body *.csv and push to remote storage
    • ? (please open an issue if you would like an additional data collection or pre processing step added)
  • 02_analyses